The notebook will run as-is without modification. See the README for more details about how to adjust data download and training time.
# General purpose
import pandas as pd
import numpy as np
import math
import git
import os
import shutil
import glob
import random
# For plotting
from matplotlib import pyplot as plt
import seaborn as sns
# Matplotlib converters
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()
# For handling time/date series
from datetime import datetime, time, timedelta, date
# For data processing
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder
# For clustering analysis
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
# Evaluation metrics
from sklearn.metrics import mean_squared_log_error
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import make_scorer
# Prediction modelling
import lightgbm
from lightgbm import LGBMRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import TimeSeriesSplit
from sklearn.pipeline import Pipeline
from sklearn.utils import shuffle
The aim of this project is to explore the measures taken by governments around the world to mitigate the effects of coronavirus.
Virus COVID-19 first emerged in China in late 2019. Since then it has spread around the globe and the World Health Organisation declared it a pandemic on 12th March 2020. On the 8th March 2021, there have been over 110 million confirmed cases and over 2.5 million fatalities worldwide [10]. Most governments around the world have implemented some measures to prevent the spread of the virus and minimise the number of fatalities and there is evidence that these have saved many lives [17]. Although many of the measures taken are similar, such as social distancing, government approaches have varied in their exact type, strictness and timing. The impact of the virus in terms of the number of cases and fatalities has also varied between countries, and this leads to the question of whether some responses are more effective than others in controlling the virus. Identifying these factors could save lives in any future widespread disease outbreak and may help in planning the lifting of these measures as the crisis abates. This provides the motivation for this project.
This project will address the following specific questions:
How has the level of excess fatalities to date been affected by a) the speed at which different governments implemented measures and b) the extent and speed of virus testing programmes? (Excess fatality is defined as the difference between the death rate from all causes during the COVID-19 pandemic and the average death rate in the same time period over recent years.)
What combinations of measures are associated with a low level of excess fatalities during the pandemic? A K-means clustering model will be developed to explore this.
Can the progression of the virus in a country (in terms of numbers of cases) be predicted using data about the government responses? If so, is it possible to infer which responses are most important in influencing (and stopping) the spread of the virus? A regression model will be developed to explore this.
This is a data science project conducted primarily for education and interest. The author makes no claim to be an expert in any branch of medicine or related fields. Individuals looking for advice on what they should do during the COVID-19 crisis should look to well established, reliable information sources such as the World Health Organisation, government-backed health organisations, of which there are many, for example the UK Government [1]
This project does not aim to make any judgement on how "good" or "bad" a government's response to the COVID-19 crisis has been; rather, the aim is to find the effectiveness of government measures.
The author has absolutely no wish to trivialise this extremely serious situation. This is a project seeking to further our collective understanding of the virus to help minimise fatalities and bring the crisis to an end.
This project grew out of the final project for the Udacity Data Science course [2]. Some of the techniques used in this project were learned in that course, and some of my other projects in my GitHub repository also use them. Inspiration for the project came from Kaggle's COVID-19 challenge [3]. This project relies on several data sources (see section 2.1.) and the author gratefully acknoweledges the work of The University of Oxford [4] and [5], Our World in Data [6] and [7], The Economist [8] and [9] and John Hopkins University [10] in making these available.
There are three main data sources.
A daily time series providing data on confirmed cases, fatalities, and a detailed breakdown of the measures implemented by governments. See [4] and [5].
A daily time series providing some data complimentary to the first dataset, including numers of cases and fatalities per head of population and testing data. See [6] and [7]. References to underlying data sources are given in [7].
A weekly time series providing information on excess fatalities, which is the number of additional deaths recorded from any cause compared to average mortality rates in previous years. Excess fatalities provides a good way to compare the outcomes of measures between countries. It is only available for some countries. See [8] and [9]
John Hopkins University also provides an excellent data set and dashboard [10]. This is not used directly in this project but provides some of the underlying data for several of the data sources.
The data can be gathered and updated from the GitHub repositories provided in 2.1. Data Sources.
The following code will attempt to determine if the data repositories have already been downloaded or not. If a local directory is found with the expected name defined in the variables named like *DATA_DIR below then that is assumed to be the local copy of the data and it will be used. If this directory is not found locally then the code will clone it from the remote and place it in the same directory as the notebook is running in. If you wish to clear the local copy of the data and check it out cleanly then please delete the appropriate directory manually. The directory names are defined in OXFORD_DATA_DIR, OWID_DATA_DIR and ECONOMIST_DATA_DIR.
The following code will checkout the data to the latest sha on 30th November, 2020. This is so the processing that follows is repeatable and won't be broken by future changes in the data set. This can of course be changed to a different sha or to the latest available version if you want to use more up-to-date data.
def setup_data(local_dir, remote_url):
""" Set up local data repositories.
If a local copy of the directory with the expected name is present then
this will be used as the local copy of the data repository.
If no local copy is found then clone it from the remote.
param local_dir: Relative path to local directory to hold data
param remote_url: Remote repository
return repo: Reference to the local copy of the repository
"""
if os.path.exists(local_dir):
print("Local {} found, using that as data source.".format(local_dir))
print("""If you would rather delete {0} and check it out again
cleanly, please delete {1} manually and rerun the notebook."""
.format(local_dir, local_dir))
print("")
repo = git.Repo(local_dir)
else:
print("No local copy of {} found, cloning from the remote.".format(
local_dir))
print("")
repo = git.Repo.clone_from(remote_url, local_dir)
return repo
# Local data directories and remote GitHub URLs
OXFORD_DATA_DIR = 'OxCGRT-covid-policy-tracker-data'
OXFORD_REMOTE_URL = 'https://github.com/OxCGRT/covid-policy-tracker.git'
OWID_DATA_DIR = 'owid-covid-19-data'
OWID_REMOTE_URL = 'https://github.com/owid/covid-19-data'
ECONOMIST_DATA_DIR = 'economist-covid-19-excess-deaths-tracker'
ECONOMIST_REMOTE_URL = \
'https://github.com/TheEconomist/covid-19-excess-deaths-tracker'
# Git SHAs for end of November 2020. This is to give a stable data source.
OXFORD_SHA_END_NOV_2020 = 'd3df75e6d'
OWID_SHA_END_NOV_2020 = 'fc58fad2'
ECONOMIST_SHA_END_NOV_2020 = 'eaf2d3e'
# Set up the repos
oxford_repo = setup_data(OXFORD_DATA_DIR, OXFORD_REMOTE_URL)
owid_repo = setup_data(OWID_DATA_DIR, OWID_REMOTE_URL)
economist_repo = setup_data(ECONOMIST_DATA_DIR, ECONOMIST_REMOTE_URL)
# Set up the data frames
oxford_repo.git.checkout(OXFORD_SHA_END_NOV_2020)
oxford_df = pd.read_csv(os.path.join(OXFORD_DATA_DIR,
'data/OxCGRT_latest.csv'),
dtype={'RegionName': str, 'RegionCode': str})
owid_repo.git.checkout(OWID_SHA_END_NOV_2020)
owid_df = pd.read_csv(os.path.join(OWID_DATA_DIR,
'public/data/owid-covid-data.csv'))
economist_repo.git.checkout(ECONOMIST_SHA_END_NOV_2020)
economist_excess_df = pd.read_csv(os.path.join(
ECONOMIST_DATA_DIR,
'output-data/excess-deaths/all_weekly_excess_deaths.csv'))
economist_monthly_excess_df = pd.read_csv(os.path.join(
ECONOMIST_DATA_DIR,
'output-data/excess-deaths/all_monthly_excess_deaths.csv'))
oxford_df.head()
oxford_df.info()
oxford_df.isnull().sum()
The oxford_df includes detailed information about the measures taken by governments along with case and fatality numbers from the European Centre for Disease Prevention and Control [13]. The data is a daily time series starting on 1st January 2020, updated daily and is organised by country. For some countries, it is additionally subdivided into regions (e.g. US states). There are many null values to deal with in the data.
owid_df.head()
owid_df.info()
owid_df.isnull().sum()
The owid_df includes several data fields complimentary to the oxford_df including case and fatality numbers per head of population and testing data. The data is a daily time series starting on 31st December 2019, it is updated daily and is organised by country. There are many null values to deal with in the data.
Note that total_cases and total_deaths in this data frame should be equal to ConfirmedCases and ConfrmedFatalities in oxford_df (they use the same source information). However, total_cases and total_deaths have a lot less null values than ConfirmedCases and ConfrmedFatalities. See also section 3.2.1.2. for more about cross checking the consistency of these fields and choosing which to use.
economist_excess_df.head()
economist_excess_df.info()
economist_excess_df.isnull().sum()
The most important data from the economist_excess_df is excess deaths. This is given for a limited number of places organised by country and region. It is a weekly time series. There are no null values to deal with.
economist_monthly_excess_df.info()
economist_monthly_excess_df.isnull().sum()
The economist_monthly_excess_df is a monthly time series with the same feature set as economist_excess_df. Again, it has no null values.
In this project the data will be examined on a country level so we can drop rows corresponding to regional (rather than country-wide) data. The features RegionName, RegionCode and Jurisdiction are not needed and will be dropped.
M1_Wildcard is also dropped here. This is because it is a free text notes field which can't be used easily in any automated processing.
# Drop rows containing regional (rather than country wide) data
oxford_df.drop(oxford_df[oxford_df['RegionName'].notnull()].index,
inplace=True)
# Drop features that are not needed
oxford_df.drop(columns=['RegionName', 'RegionCode',
'Jurisdiction', 'M1_Wildcard'], inplace=True)
All of the other data fields from this source will be retained for now.
There are many null values in oxford_df, and it appears that many (perhaps all) of these occur in the early stages of the pandemic when the levels of different measures were simply not recorded. It is reasonable to set these values to 0 (indicating no measures in place). If a null value occurs after a non-null value has been recorded, then assume the measurement is unchanged since the last non-null, i.e. replace the null value with the previous non-null value. This cleaning step will be done in section 3.1.1.3. in common with owid_df.
This project will use a subset of the data in owid_df, and some fields are duplicated by oxford_df, so drop some fields now.
owid_df.drop(columns=['continent', 'stringency_index'], inplace=True)
For the remaining values:
If a null value occurs after a non-null value has been recorded, then assume the measurement is unchanged since the last non-null, i.e. replace the null value with the previous non-null value. (Same process as in oxford_df.)
However, for the data in owid_df, it is not always safe to assume that all null values at the start of the pandemic are 0. For example, the number of tests carried out before the pandemic started will be 0, but the number of smokers, or the diabetes prevalence, cannot be assumed to start at 0.
To handle this, create two lists, each one corresponding to the different handling of the initial null values.
# Features that can be safely assumed to be zero at the start of the series
owid_features_set_zero_at_start = ['total_cases',
'new_cases',
'total_deaths',
'new_deaths',
'total_cases_per_million',
'new_cases_per_million',
'total_deaths_per_million',
'new_deaths_per_million',
'total_tests',
'new_tests',
'total_tests_per_thousand',
'new_tests_per_thousand',
'new_tests_smoothed',
'new_tests_smoothed_per_thousand',
'hosp_patients',
'hosp_patients_per_million',
'icu_patients',
'icu_patients_per_million',
'positive_rate',
'reproduction_rate',
'tests_per_case',
'weekly_hosp_admissions',
'weekly_hosp_admissions_per_million',
'weekly_icu_admissions',
'weekly_icu_admissions_per_million',
'new_cases_smoothed',
'new_cases_smoothed_per_million',
'new_deaths_smoothed',
'new_deaths_smoothed_per_million']
# Features that can't be assumed to be zero at the start of the series
owid_features_not_set_zero_at_start = np.setdiff1d(
owid_df.columns, owid_features_set_zero_at_start)
owid_features_not_set_zero_at_start
Remove location from the list, because this will be renamed later, and there are no null values for location in the data set to be fixed anyway.
owid_features_not_set_zero_at_start = (
owid_features_not_set_zero_at_start.tolist())
owid_features_not_set_zero_at_start.remove('location')
owid_features_not_set_zero_at_start
Create a common function to fix null values in oxford_df and owid_df.
def fix_nulls(df, feature, assume_zeros_at_start):
""" Fix null values for a feature in a dataframe.
If assume_zeros_at_start is True then null values between the first
date in the dataframe and the first non-null value should be set to 0.
If False, they should remain as null values.
param df: dataframe
param feature: feature in the dataframe containing nulls
param assume_zeros_at_start: whether to replace nulls at the start with 0
"""
country_groups = df.groupby(['country'])
feature_list = []
for c in country_groups.groups.keys():
feature_list = country_groups.get_group(c)[feature].tolist()
corrected = False
for i in range(len(feature_list)):
# Fill in null values
if (pd.isnull(feature_list[i])):
# Set first data point in series to 0
if (i == 0) and (assume_zeros_at_start is True):
feature_list[i] = 0
corrected = True
# Subsequent null values are set equal to previous value.
elif i > 0:
feature_list[i] = feature_list[i-1]
corrected = True
if corrected is True:
print("Filling null values for location {0}".format(c))
df.loc[df['country'] == c, feature] = feature_list
Before using this to fix null values in oxford_df and owid_df, rename the CountryName field in oxford_df and location in owid_df to a common name country:
owid_df.rename(columns={'location': 'country'}, inplace=True)
oxford_df.rename(columns={'CountryName': 'country'}, inplace=True)
oxford_df.isnull().sum().sum()
for feature in oxford_df.columns:
fix_nulls(oxford_df, feature, True)
oxford_df.isnull().sum().sum()
Now fix the null values we can in owid_df:
owid_df.isnull().sum().sum()
for feature in owid_features_set_zero_at_start:
fix_nulls(owid_df, feature, True)
owid_df.isnull().sum().sum()
for feature in owid_features_not_set_zero_at_start:
fix_nulls(owid_df, feature, False)
owid_df.isnull().sum().sum()
All the nulls in oxford_df are fixed. There are still some nulls in owid_df, but some of these aill not be needed, and some will be corrected later.
It's also necessary to correct simple data errors. This step will be easier after merging oxford_df and owid_df, so it will be done later (section 3.2.1.2.).
To give a consistent week numbering scheme (necessary for the merge step which will come later) convert start_date and end_date to type datetime, create a new field calendar_week and drop the week feature:
economist_excess_df['start_date'] = pd.to_datetime(
economist_excess_df['start_date'], format='%Y-%m-%d')
economist_excess_df['end_date'] = pd.to_datetime(
economist_excess_df['end_date'], format='%Y-%m-%d')
economist_excess_df['calendar_week'] = economist_excess_df[
'start_date'].dt.isocalendar().week
economist_excess_df.drop(columns='week', inplace=True)
The main feature of interest is excess_deaths. Many of the others can be dropped.
economist_excess_df includes data per country, and a breakdown by region. We are only interested in the country data. Therefore, drop rows where region != country.
rows_to_drop = [i for i in range(economist_excess_df.shape[0]) if
(economist_excess_df.iloc[i]['country'] !=
economist_excess_df.iloc[i]['region'])]
economist_excess_df.drop(index=rows_to_drop, inplace=True)
Finally, drop columns we don't need.
economist_excess_df.drop(columns=['region',
'region_code',
'population',
'start_date',
'end_date',
'year',
'total_deaths',
'covid_deaths',
'expected_deaths',
'non_covid_deaths',
'covid_deaths_per_100k'],
inplace=True)
economist_excess_df.isnull().sum()
No more nulls remain.
economist_excess_df['country'].unique()
This data set will be merged with others later on. One approximation will be made here, that is to rename 'Britain' to 'United Kingdom'. These are not the same entity but since the government measures taken in both are at least similar, it seems reasonable to make this approximation.
economist_excess_df.replace(to_replace="Britain",
value="United Kingdom",
inplace=True)
This data set will be less useful than the weekly time series but will add a few data points to the clustering analysis later on. A lot of the information in this data frame is regional, and since the analysis will be country-based, this is not so useful. The preprocessing treatment is similar to the weekly data frame above.
rows_to_drop = [i for i in range(economist_monthly_excess_df.shape[0]) if
(economist_monthly_excess_df.iloc[i]['country'] !=
economist_monthly_excess_df.iloc[i]['region'])]
economist_monthly_excess_df.drop(index=rows_to_drop, inplace=True)
economist_monthly_excess_df.drop(columns=['region',
'region_code',
'population',
'start_date',
'end_date',
'year',
'total_deaths',
'covid_deaths',
'expected_deaths',
'non_covid_deaths',
'covid_deaths_per_100k',
'total_deaths_per_7_days',
'covid_deaths_per_7_days',
'expected_deaths_per_7_days',
'excess_deaths_per_7_days',
'non_covid_deaths_per_7_days',
'covid_deaths_per_100k_per_7_days',
'excess_deaths_per_100k_per_7_days'],
inplace=True)
economist_monthly_excess_df.isnull().sum()
economist_monthly_excess_df['country'].unique()
This adds just three countries to the data.
economist_monthly_excess_df.head()
It is useful to merge the data sets from Oxford and Our World in Data because they are both daily time series and they provide some complimentary features.
np.setdiff1d(oxford_df['country'].unique(), owid_df['country'].unique())
np.setdiff1d(owid_df['country'].unique(), oxford_df['country'].unique())
Both data frames contain some unique country values. Most of these appear to be either regions of another country, or countries with relatively small numbers of cases, so rather than include these with incomplete data, it is reasonable to drop them from the data set. It would be ideal to do this when the oxford_df and owid_df are merged, using an inner join. However, the join needs to be performed on both country and date, and not all the countries in owid_df include the full date range from 1st January 2020, so this isn't possible. Instead, the merge and tidy-up will be performed in two steps:
date and country keys in oxford_dfoxford_dfFirst convert the date field to datetime format.
oxford_df['Date'] = pd.to_datetime(oxford_df['Date'], format='%Y%m%d')
oxford_df.rename(columns={'Date': 'date'}, inplace=True)
owid_df['date'] = pd.to_datetime(owid_df['date'], format='%Y-%m-%d')
combined_daily_df = oxford_df.merge(owid_df,
how='left',
on=['country', 'date'])
# Cannot use 'inner' join because owid_df is missing some data values we need
Now drop countries which are unique to oxford_df:
to_drop = list(np.setdiff1d(oxford_df['country'].unique(),
owid_df['country'].unique()))
combined_daily_df.drop(combined_daily_df.loc[
combined_daily_df['country'].isin(to_drop)].index, inplace=True)
The excess deaths data only exists for a few countries and it's presented as a week-by-week (rather than day-by-day) time series, so it doesn't make sense to merge this into the combined_daily_df.
combined_daily_df.head()
Perhaps a little confusingly, the different data sources use different conventions for labelling their features (lower case and underscores from owid_df, such as total_cases, and a version of camel case from oxford_df, such as ConfirmedCases). Although it may not look good, this is not a big problem so the feature names will be retained as-is.
It's also necessary to do some basic error correction on the data. In particular, two of the fields in oxford_df and owid_df should match because they are from the same source. Specifically, the following fields should be equal:
ConfirmedCases == total_casesConfirmedDeaths == total_deathsThis can be checked:
deaths_differences_df = combined_daily_df[
(combined_daily_df['ConfirmedDeaths'] !=
combined_daily_df['total_deaths'])][['country',
'ConfirmedDeaths',
'total_deaths']]
deaths_differences_df['Difference'] = abs(
deaths_differences_df['ConfirmedDeaths'] -
deaths_differences_df['total_deaths'])
deaths_differences_df[deaths_differences_df['Difference'] ==
deaths_differences_df['Difference'].max()]
This shows that there are some differences.
We can make a similar check on the confirmed cases data:
cases_differences_df = combined_daily_df[
(combined_daily_df['ConfirmedCases'] !=
combined_daily_df['total_cases'])][['country',
'ConfirmedCases',
'total_cases']]
cases_differences_df['Difference'] = abs(
cases_differences_df['ConfirmedCases'] -
cases_differences_df['total_cases'])
cases_differences_df[cases_differences_df['Difference'] ==
cases_differences_df['Difference'].max()]
cases_differences_df[cases_differences_df['Difference'] ==
cases_differences_df['Difference'].max()]
Again, there are some differences.
Both data sources are dynamic and updated daily. The differences could be due to some legitimate adjustments/corrections, or it could be due to a time lag in data reporting and import. It was also noted above that the owid_df had fewer null values for total_cases and total_deaths. Both data sets come from reputable sources and both are are likely to be reliable but since we have to choose one source or other in order to perform the rest of the analysis, we will choose to use the total_cases and total_deaths values from Our World in Data. Therefore, drop ConfirmedCases and ConfirmedDeaths, and also iso_code and CountryCode (not needed).
combined_daily_df.drop(columns=['ConfirmedCases',
'ConfirmedDeaths',
'iso_code',
'CountryCode'], inplace=True)
The merge has introduced some additional null values. This is where new dates have been imputed into combined_daily_df for some countries. We can explore these:
print(combined_daily_df[['new_deaths_per_million']].isnull().sum())
print(combined_daily_df[['new_cases_per_million']].isnull().sum())
print(combined_daily_df[['total_cases']].isnull().sum())
print(combined_daily_df[['total_cases_per_million']].isnull().sum())
print(combined_daily_df[['total_deaths']].isnull().sum())
print(combined_daily_df[['total_deaths_per_million']].isnull().sum())
print(combined_daily_df[['new_tests_smoothed']].isnull().sum())
print(combined_daily_df[['new_tests_smoothed_per_thousand']].isnull().sum())
Each feature has the same number of null values, so check if they are all null across the same rows:
print(combined_daily_df[combined_daily_df['total_cases'].isnull() &
combined_daily_df['new_cases'] > 0].shape[0])
print(combined_daily_df[combined_daily_df['total_deaths'].isnull() &
combined_daily_df['new_cases'] > 0].shape[0])
print(combined_daily_df[combined_daily_df['total_cases_per_million'].isnull()
& combined_daily_df['new_cases'] > 0].shape[0])
print(combined_daily_df[combined_daily_df['total_deaths_per_million'].isnull()
& combined_daily_df['new_cases'] > 0].shape[0])
print(combined_daily_df[combined_daily_df['new_cases_per_million'].isnull()
& combined_daily_df['new_cases'] > 0].shape[0])
print(combined_daily_df[combined_daily_df['new_deaths_per_million'].isnull()
& combined_daily_df['new_cases'] > 0].shape[0])
print(combined_daily_df[combined_daily_df['new_tests_smoothed'].isnull() &
combined_daily_df['new_cases'] > 0].shape[0])
print(combined_daily_df[combined_daily_df[
'new_tests_smoothed_per_thousand'].isnull() &
combined_daily_df['new_cases'] > 0].shape[0])
So for all of these features, the NaN values are when new_cases==0, i.e. before the first cases have been recorded. All of these nulls can be replaced with 0.
combined_daily_df.loc[combined_daily_df['total_cases'].isnull(),
['new_cases',
'new_deaths',
'total_cases',
'total_cases_per_million',
'total_deaths',
'total_deaths_per_million',
'new_cases_per_million',
'new_deaths_per_million',
'new_tests_smoothed',
'new_tests_smoothed_per_thousand']] = 0
A certain number of countries have null values for population.
combined_daily_df[combined_daily_df['population'].isnull()]
In some cases this can be fixed by assuming that any value of population for the country concerned can be used to fill the missing values (it seems safe to assume no significant change in population during the period of interest). For cases where this can't be done, the countries will be dropped.
countries_null_pop = set(combined_daily_df[
combined_daily_df['population'].isnull()]['country'])
for c in countries_null_pop:
pop = combined_daily_df[
combined_daily_df['country'] == c]['population'].max()
if math.isnan(pop):
combined_daily_df.drop(
index=combined_daily_df[combined_daily_df['country'] == c].index,
inplace=True)
else:
combined_daily_df.loc[(combined_daily_df['country'] == c) &
(combined_daily_df['population'].isnull()),
'population'] = pop
As a final check, look for data points where total_cases or total_deaths decrease with increasing time. This should never happen in reality but could exist in the data set as an error or as a correction to previous data errors. The following function attempts to make some automatic fixes to the data series when this happens.
def fix_errors(df, feature):
"""Detect and fix errors in the data.
Total cases and total deaths should not decrease. Check this condition
and where possible correct it.
param df: dataframe
param feature: feature to check and correct
"""
country_groups = df.groupby(['country'])
feature_list = []
for c in country_groups.groups.keys():
feature_list = country_groups.get_group(c)[feature].tolist()
corrected = False
for i in range(len(feature_list)-1):
# Detect a data error
if (feature_list[i] > feature_list[i+1]):
# Correct a one-off low data point
try:
if (feature_list[i] <= feature_list[i+2]):
print("Correcting low data point."
"Replaced {0} with {1} for country {2}".format(
feature_list[i+1],
feature_list[i],
c))
feature_list[i+1] = feature_list[i]
corrected = True
# Correct a one-off high data point
else:
if (feature_list[i-1] <= feature_list[i+1]):
print("Correcting high data point. "
"Replaced {0} with {1} for country {2}"
.format(
feature_list[i],
feature_list[i-1],
c))
feature_list[i] = feature_list[i-1]
corrected = True
else:
print("Not able to correct an erroneous point "
"for for country {0} automatically"
.format(c))
# Where there is no data point at i+2, i.e. i is penultimate
except IndexError:
print("Correcting penultimate data point. "
"Replaced {0} with {1} for country {2}"
.format(
feature_list[i+1],
feature_list[i],
c))
feature_list[i+1] = feature_list[i]
corrected = True
if corrected is True:
print("Correcting for country {0}".format(c))
df.loc[df['country'] == c, feature] = feature_list
fix_errors(combined_daily_df, 'total_cases')
fix_errors(combined_daily_df, 'total_deaths')
There are only a small number of errors which could not be corrected by this. This is acceptable.
After correcting total_deaths and total_cases, it's necessary to recalculate new_cases and new_deaths to be consistent with this.
country_groups = combined_daily_df.groupby(['country'])
for c in list(country_groups.groups.keys()):
country_group = country_groups.get_group(c)
# Change in confirmed cases
combined_daily_df.loc[(combined_daily_df['country'] == c),
'new_cases'] = country_group['total_cases'].diff()
combined_daily_df.loc[(combined_daily_df['country'] == c),
'new_deaths'] = country_group['total_deaths'].diff()
Using diff() above will introduce null values on the first row. Correct for these now, and any examples where new_cases and new_deaths are calculated as less than 0 (which is an error):
combined_daily_df.loc[
combined_daily_df['new_cases'].isnull(), 'new_cases'] = 0
combined_daily_df.loc[
combined_daily_df['new_deaths'].isnull(), 'new_deaths'] = 0
combined_daily_df.loc[combined_daily_df['new_cases'] < 0, 'new_cases'] = 0
combined_daily_df.loc[combined_daily_df['new_deaths'] < 0, 'new_deaths'] = 0
combined_daily_df.head()
First convert the daily time series data into a weekly time series. This will make it possible to merge it with the weekly time series for excess fatalities, which will enable the analysis in future sections.
Sort data so it is grouped on a per-country basis.
combined_daily_df = combined_daily_df.sort_values(by=['country', 'date'])
Create a field for the calendar week.
combined_daily_df['calendar_week'] = combined_daily_df[
'date'].dt.isocalendar().week
combined_daily_df.head()
combined_daily_df['tests_units'].unique()
tests_units is a categorical variable and it does have an impact on interpreting the numbers of tests carried out. A PCR test detects the presence of the virus in the body (i.e., whether someone is currently infected) but other tests can be used (e.g, an antibody test, which could detect if someone was previously infected and produced antibodies) [14]. The following shows how test_units can be encoded as a categorical variable and also gives a note on interpreting each category:
Encoding like this allows this feature to be included in the weekly_df.
combined_daily_df.loc[combined_daily_df['tests_units'].isnull(),
'tests_units'] = 0
combined_daily_df.replace(to_replace=['units unclear',
'units unclear (incl. non-PCR)',
'tests performed',
'tests performed (incl. non-PCR)',
'samples tested',
'people tested',
'people tested (incl. non-PCR)'],
value=[0, 0, 1, 2, 3, 4, 5],
inplace=True)
Now aggregate data from the daily time series into a weekly time series.
weekly_df = combined_daily_df.groupby(
['country',
'calendar_week']).agg({'C1_School closing': 'mean',
'C1_Flag': 'mean',
'C2_Workplace closing': 'mean',
'C2_Flag': 'mean',
'C3_Cancel public events': 'mean',
'C3_Flag': 'mean',
'C4_Restrictions on gatherings': 'mean',
'C4_Flag': 'mean',
'C5_Close public transport': 'mean',
'C5_Flag': 'mean',
'C6_Stay at home requirements': 'mean',
'C6_Flag': 'mean',
'C7_Restrictions on internal movement': 'mean',
'C7_Flag': 'mean',
'C8_International travel controls': 'mean',
'E1_Income support': 'mean',
'E1_Flag': 'mean',
'E2_Debt/contract relief': 'mean',
'E3_Fiscal measures': 'mean',
'E4_International support': 'mean',
'H1_Public information campaigns': 'mean',
'H1_Flag': 'mean',
'H2_Testing policy': 'mean',
'H3_Contact tracing': 'mean',
'H4_Emergency investment in healthcare': 'mean',
'H5_Investment in vaccines': 'mean',
'H6_Facial Coverings': 'mean',
'H6_Flag': 'mean',
'total_cases': 'max',
'new_cases': 'sum',
'total_deaths': 'max',
'new_deaths': 'sum',
'StringencyIndex': 'mean',
'ContainmentHealthIndex': 'mean',
'EconomicSupportIndex': 'mean',
'GovernmentResponseIndex': 'mean',
'total_tests_per_thousand': 'max',
'new_tests_smoothed_per_thousand': 'sum',
'new_tests_per_thousand': 'sum',
'tests_units': 'min'})
weekly_df.reset_index(inplace=True)
# Round the values of indicators so they retain defined whole number values
weekly_df = weekly_df.round(0)
weekly_df.head()
Calculate a WeekNumber where "Week 1" of the virus is defined as the first week when 35 fatalities due to COVID-19 were recorded. In section 3.3., a DayNumber consistent with this will be calculated for the daily time series. This is defined as the first day when 5 new fatalities were recorded (which is the same definition as is used by Our World in Data).
weekly_df['WeekNumber'] = 0
country_groups = weekly_df.groupby(['country'])
for c in list(country_groups.groups.keys()):
country_group = country_groups.get_group(c)
index_labels = country_group[
country_group['new_deaths'] >= 35].index.tolist()
if len(index_labels) > 0:
first_index_in_group = country_group.index.tolist()[0]
last_index_in_group = country_group.index.tolist()[-1]
# WeekNumber increasing after Week 0
for i in range(index_labels[0], last_index_in_group+1):
weekly_df.loc[i, 'WeekNumber'] = weekly_df.loc[i-1,
'WeekNumber'] + 1
# WeekNumber decreasing before Week 0
# index_labels[0]-1 is "Week 0" so don't change that one
for i in range(index_labels[0]-2, first_index_in_group-1, -1):
weekly_df.loc[i, 'WeekNumber'] = weekly_df.loc[i+1,
'WeekNumber'] - 1
weekly_df.head()
economist_monthly_excess_df
Peru has 10 months of data, but the others have 9. Drop month 10, so that all countries from this source have the same amount of data available.
economist_monthly_excess_df.drop(
economist_monthly_excess_df.loc[
economist_monthly_excess_df['month'] == 10].index, inplace=True)
Create a monthly date range.
# Create the end_date field - values will be filled in in the next step
economist_monthly_excess_df['end_date'] = pd.to_datetime('2020-01-01',
format='%Y-%m-%d')
# dates for month_ends
month_ends = pd.date_range(start='2019-12-01',
freq='M',
periods=economist_monthly_excess_df[
'month'].max()+1)
# Set the values for end_date to the month_ends
country_groups = economist_monthly_excess_df.groupby(['country'])
for c in list(country_groups.groups.keys()):
economist_monthly_excess_df.loc[
economist_monthly_excess_df['country'] == c,
'end_date'] = month_ends[1:]
Convert monthly time series to weekly time series.
for i, c in enumerate(list(economist_monthly_excess_df['country'].unique())):
# country_monthly_excess_df holds data for one country
country_monthly_excess_df = economist_monthly_excess_df[
economist_monthly_excess_df['country'] == c].copy()
# Insert a row to hold a "start date" - this is month_ends[0], which is
# the month end for the month immediately to the start of the series
country_monthly_excess_df.loc[-1] = [c, 0, 0, 0, 0, month_ends[0]]
# Increase the indices by 1, because we have inserted a row
country_monthly_excess_df.index = country_monthly_excess_df.index + 1
country_monthly_excess_df = country_monthly_excess_df.sort_index()
# Change the index to be end_date - this is the dates from month_ends
country_monthly_excess_df.set_index(
keys=country_monthly_excess_df['end_date'], inplace=True)
# Now convert to a weekly series and impute gaps in the data
country_monthly_excess_df = country_monthly_excess_df.asfreq(
freq='W',
method='bfill')
# Set 'dates' to the values of index and drop the old index
country_monthly_excess_df['dates'] = list(
country_monthly_excess_df.index.values)
country_monthly_excess_df.reset_index(drop=True, inplace=True)
# Insert calendar week
country_monthly_excess_df['calendar_week'] = country_monthly_excess_df[
'dates'].dt.isocalendar().week
# Divide monthly totals by number of weeks - assume uniformly distributed
month_groups = country_monthly_excess_df.groupby(['month'])
for feature in ['excess_deaths', 'excess_deaths_per_100k',
'excess_deaths_pct_change']:
for m in list(month_groups.groups.keys()):
month = month_groups.get_group(m)
country_monthly_excess_df.loc[country_monthly_excess_df[
'month'] == m, feature] = month[feature]/month.shape[0]
# Append country_monthly_excess_df for each country together.
# This builds a weekly df for all countries in economist_monthly_excess_df
if i == 0:
monthly_excess_df = country_monthly_excess_df
else:
monthly_excess_df = monthly_excess_df.append(country_monthly_excess_df)
# Drop some features which are no longer useful
monthly_excess_df.drop(columns=['month', 'end_date', 'dates'], inplace=True)
monthly_excess_df.head()
Since monthly_excess_df is now a weekly series, we can append it to the exisiting weekly series for excess deaths, economist_excess_df.
economist_excess_df = economist_excess_df.append(monthly_excess_df)
economist_excess_df.tail()
Excess deaths is a measure of the additional deaths taking place during the pandemic compared to the number of deaths expected (based on historical data) over the same time period due to all causes.
weekly_df.rename(columns={'Calendar_Week': 'calendar_week'}, inplace=True)
weekly_excess_deaths_df = weekly_df.merge(economist_excess_df,
how='right',
on=['country', 'calendar_week'])
weekly_excess_deaths_df.head()
weekly_excess_deaths_df.info()
In many countries, Saturdays and Sundays are not full working days, and as a result deaths (from any cause) that occur on a Saturday or Sunday may not be formally registered until the start of the working week on Monday. This can lead to a 7-day periodicity in reported COVID-19 fatalities. For this reason, it is useful to look at a 7-day rolling average of both the number of cases and number of fatalities.
In addition, the day by day change in the 7-day rolling average is also a useful measure for modelling (see section 6).
Therefore, calculate the 7-day rolling average and change in the 7-day rolling average for both the number of confirmed cases and fatalities (7DANewCases, 7DADeltaNewCases, 7DANewFatalities, 7DADeltaNewFatalities).
combined_daily_df['7DANewCases'] = 0
combined_daily_df['7DADeltaNewCases'] = 0
# 7 day rolling average of new confirmed cases
country_groups = combined_daily_df.groupby(['country'])
for c in list(country_groups.groups.keys()):
country_group = country_groups.get_group(c)
# Change in confirmed cases
combined_daily_df.loc[(combined_daily_df['country'] == c),
'7DANewCases'] = (country_group['new_cases'].
rolling(7, min_periods=1).mean())
combined_daily_df.loc[combined_daily_df['7DANewCases'].isnull(),
'7DANewCases'] = 0
country_groups = combined_daily_df.groupby(['country'])
for c in list(country_groups.groups.keys()):
country_group = country_groups.get_group(c)
# Change in confirmed cases
combined_daily_df.loc[(combined_daily_df['country'] == c),
'7DADeltaNewCases'] = country_group['7DANewCases'
].diff()
combined_daily_df.loc[combined_daily_df['7DADeltaNewCases'].isnull(),
'7DADeltaNewCases'] = 0
Next perform calculations on the number of fatalities.
combined_daily_df['7DANewFatalities'] = 0
combined_daily_df['7DADeltaNewFatalities'] = 0
# 7 day rolling average of new confirmed cases
country_groups = combined_daily_df.groupby(['country'])
for c in list(country_groups.groups.keys()):
country_group = country_groups.get_group(c)
# Change in fatalities
combined_daily_df.loc[(combined_daily_df['country'] == c),
'7DANewFatalities'] = (
country_group['new_deaths'].rolling(7, min_periods=1).mean())
combined_daily_df.loc[combined_daily_df['7DANewFatalities'].isnull(),
'7DANewFatalities'] = 0
country_groups = combined_daily_df.groupby(['country'])
for c in list(country_groups.groups.keys()):
country_group = country_groups.get_group(c)
# Change in fatalities
combined_daily_df.loc[(combined_daily_df['country'] == c),
'7DADeltaNewFatalities'] = country_group[
'7DANewFatalities'].diff()
combined_daily_df.loc[combined_daily_df['7DADeltaNewFatalities'].isnull(),
'7DADeltaNewFatalities'] = 0
For comparisons between different countries, identify when "Day 1" of the virus is for each location. This is defined as the first day when 5 new fatalities were recorded (which is the same as is used by Our World in Data).
combined_daily_df['DayNumber'] = 0
country_groups = combined_daily_df.groupby(['country'])
for c in list(country_groups.groups.keys()):
country_group = country_groups.get_group(c)
index_labels = country_group[country_group['new_deaths']
>= 5].index.tolist()
if len(index_labels) > 0:
first_index_in_group = country_group.index.tolist()[0]
last_index_in_group = country_group.index.tolist()[-1]
# DayNumber increasing after Day 0
for i in range(index_labels[0], last_index_in_group+1):
combined_daily_df.loc[i, 'DayNumber'] = combined_daily_df.loc[
i-1, 'DayNumber'] + 1
# DayNumber decreasing before Day 0
# index_labels[0]-1 is "Day 0" so don't change that one
for i in range(index_labels[0]-2, first_index_in_group-1, -1):
combined_daily_df.loc[i, 'DayNumber'] = combined_daily_df.loc[
i+1, 'DayNumber'] - 1
combined_daily_df.head()
Finally, there are two dataframes ready for use:
combined_daily_df: The daily time series including the government measures along with case, mortality and testing dataweekly_excess_deaths_df: The weekly time series derived from the daily time series above and merged with the excess fatalities dataThis section compares the numbers of confirmed cases and fatalities across different countries. The numbers per million of population are used rather than absolute numbers to allow comparison between countries with different populations.
Graph 1 below compares the case numbers for the 40 countries with the highest confirmed cases per million.
# Use totals one week ago because data is sometimes corrected retrospectively
one_week_ago = combined_daily_df['date'].max() - timedelta(7)
total_cases_df = combined_daily_df[
combined_daily_df['date'] == one_week_ago].sort_values(
by='total_cases_per_million', ascending=False).head(40)
# Label to hold references to graphs throughout the notebook
graphnum = 1
plt.figure(figsize=(18, 9))
plt.bar(total_cases_df['country'], total_cases_df['total_cases_per_million'])
plt.title("Graph {} - Total Cases per Million "
"Across Different Countries (Top 40)".format(graphnum))
plt.ylabel("Confirmed cases per million")
plt.xticks(rotation='vertical')
plt.show()
graphnum += 1
Some countries (for example, Andorra) appear as particularly high in this graph partly because they have a small population. This is not true of all the countries in the top 40 though - the USA, Brazil, European countries and others in the top 40 have large populations.
Now plot fatalities per million, maintaining the same sort order as in the graph above.
plt.figure(figsize=(18, 9))
plt.bar(total_cases_df['country'], total_cases_df['total_deaths_per_million'])
plt.title("Graph {} - Total Fatalities per Million "
"Across Different Countries".format(graphnum))
plt.ylabel("Fatalities per million")
plt.xticks(rotation='vertical')
plt.show()
graphnum += 1
The distribution of cases per million and fatalities per million across countries is very different. This shows the the fatality rate (i.e. fatalities per number of cases) differs a lot between countries.
Possible reasons are:
The last two points mean that the cases and fatalities data is probably inaccurate for some countries.
It was noted above that the numbers of COVID-19 cases and fatalities reported by countries may not always be accurate. Even a country with a comprehensive testing system and well organised healthcare is likely to make some mistakes, especially in a crisis situation.
An alternative way to measure fatalities is excess deaths. This is defined as the difference between the death rate from all causes during the COVID-19 pandemic and the average death rate in the same time period over recent years.
This measure is useful because it is unaffected even if a country does not correctly record all true COVID-19 deaths. It also includes deaths that could be indirectly attributed to COVID-19, for example when patients are unable to access healthcare for a non-COVID issue due to overload on the health services.
It is potentially a better way to compare fatalities between countries during the crisis. More information can be found in [8].
The disadvantage of the excess deaths data is that it is only available for a small subset of countries, specifically:
print("Countries with excess death data:")
countries = weekly_excess_deaths_df.groupby(['country'])
list(countries.groups.keys())
The following sections will explore the excess deaths data.
First look at the way excess deaths varies over time for each of the countries with avaliable data. Graph 3 shows the absolute numbers, Graph 4 shows the excess deaths per 100000 of population, which provides the best comparison.
countries = weekly_excess_deaths_df.groupby(['country'])
# Set styles and colours using ColorBrewer palette
STYLES = ['solid', 'dashed']
NUM_STYLES = len(STYLES)
clrs = sns.color_palette('Set1')
NUM_COLOURS = len(clrs)
count = 0
plt.figure(figsize=(18, 9))
for c in list(countries.groups.keys()):
country = countries.get_group(c)
country_data = country[country['WeekNumber'] > -5]
plt.plot(country_data['WeekNumber'],
country_data['excess_deaths'],
label=c,
color=clrs[count % NUM_COLOURS],
linestyle=STYLES[count % NUM_STYLES])
count += 1
plt.xlabel('WeekNumber')
plt.ylabel('Excess Deaths')
plt.xticks(rotation='vertical')
plt.legend()
plt.grid()
plt.title("Graph {} - Graph to Compare Excess Deaths "
"Across Countries".format(graphnum))
plt.show()
graphnum += 1
count = 0
plt.figure(figsize=(18, 9))
for c in list(countries.groups.keys()):
country = countries.get_group(c)
country_data = country[country['WeekNumber'] > -5]
plt.plot(country_data['WeekNumber'],
country_data['excess_deaths_per_100k'],
label=c,
color=clrs[count % NUM_COLOURS],
linestyle=STYLES[count % NUM_STYLES])
count += 1
plt.xlabel('WeekNumber')
plt.ylabel('Excess Deaths per 100k')
plt.xticks(rotation='vertical')
plt.legend()
plt.grid()
plt.title("Graph {} - Graph to Compare Excess Deaths per 100k "
"of Population Across Countries".format(graphnum))
plt.show()
graphnum += 1
Looking at Graph 4 showing excess deaths per 100K:
The size of the first peak varies significantly, from less than 5 per 100k for Austria and Norway, to almost 25 per 100k for Spain. This shows that the impact of the virus differed significantly from country to country.
The time elapsed between recording the first five COVID-19 deaths and the peak excess deaths per 100k also varies significantly. For most countries the peak comes at around 3 or 4 weeks. For Chile, Peru, and South Africa the first peak comes at about 10 weeks.
For several countries, excess deaths peaked more than once. In particular, several new peaks appeared after week 30. Generally, the first peak was worse than later ones. This is likely to be because countries were much more prepared when subsequent peaks came.
Compare the total excess deaths for countries. Note that this is a snapshot in time - the virus arrived in different countries at different times and the crisis still continues, so this is a dynamic situation and these numbers will change before the COVID-19 crisis is over. To account for this difference, use the same number of weeks of data for each country, even if more is available.
# Find the latest week of data that all countries have
countries = weekly_excess_deaths_df.groupby(['country'])
country_list = (list(countries.groups.keys()))
latest_weeks = []
for c in country_list:
country = countries.get_group(c)
latest_weeks.append(country['WeekNumber'].max())
latest_week = min(latest_weeks)
countries = weekly_excess_deaths_df.groupby(['country'])
country_list = (list(countries.groups.keys()))
total_excess_deaths_per_100k = []
for c in country_list:
country = countries.get_group(c)
country_data = country[(country['WeekNumber'] > 0)
& (country['WeekNumber'] <= latest_week)]
total_excess_deaths_per_100k.append(
country_data['excess_deaths_per_100k'].sum())
d = {'Country': country_list,
'total_excess_deaths_per_100k': total_excess_deaths_per_100k}
total_excess_deaths_per_100k_df = pd.DataFrame(data=d)
total_excess_deaths_per_100k_df.sort_values(
by=['total_excess_deaths_per_100k'], axis=0, ascending=False, inplace=True)
plt.figure(figsize=(18, 9))
plt.bar(x=total_excess_deaths_per_100k_df['Country'],
height=total_excess_deaths_per_100k_df['total_excess_deaths_per_100k'])
plt.xlabel("Country")
plt.ylabel("Total Excess Deaths per 100k")
plt.xticks(rotation='vertical')
plt.title("Graph {} - Graph to Compare Total Excess Deaths "
"per 100k of Population Across Countries".format(graphnum))
plt.show()
graphnum += 1
Of the countries for which data exists, Peru, Ecuador and Mexico have the highest numbers of excess deaths. There is a large group with a middling level of excess deaths (50-100 per 100k), and a smaller group (Switzerland, Austria, Germany, Denmark, Norway) with relatively low excess deaths (<50 per 100k).
During the pandemic, different governments took various measures to inhibit the virus. The aim of this section is to explore how government measures varied over time, and how this might relate to the numbers of cases and fatalities. It's beyond the scope of this project to analyse the response of every country. Instead, this section focusses on a few example countries to demonstrate some different government responses with different results.
Two graphs will be shown for each country.
The first graph shows the 7 day rolling average of new cases and new fatalities. Absolute numbers are used (rather than per head of population) because this section focusses on the relationship between government measures and the cases and fatalities in an individual country, rather than comparisons between countries. The aim of government responses around the world has been to not only minimise total cases and (especially) fatalities, but also to broaden and flatten the peak of cases so that local health systems could cope with the influx of patients.
The second graph for each country shows the variation in government responses. This is done using the data from [4] which tracks a set of government measures on a day by day basis. These rank the strength of the government measures on a numerical scale, typically from 0 to 3, and indicates whether the measures are enforced nationally or locally. The authors of [4] use these individual measures to calculate four composite indices on a scale of 1 to 100. It is these four composite indices which are plotted in the graphs below.
The indices are fully defined in [4]. For convenience, the following summary is copied from the same source:
The data from the 17 indicators is aggregated into a set of four common indices, reporting a number between 1 and 100 to reflect the level of government action on the topics in question:
- an overall government response index (which records how the response of governments has varied over all indicators in the database, becoming stronger or weaker over the course of the outbreak);
- a containment and health index (which combines ‘lockdown’ restrictions and closures with measures such as testing policy and contact tracing, short term investment in healthcare, as well investments in vaccine)
- an economic support index (which records measures such as income support and debt relief)
- as well as the original stringency index (which records the strictness of ‘lockdown style’ policies that primarily restrict people’s behaviour).
Note that these indices simply record the number and strictness of government policies, and should not be interpreted as ‘scoring’ the appropriateness or effectiveness of a country’s response. A higher position in an index does not necessarily mean that a country's response is ‘better’ than others lower on the index.
def graph_daily_measures_and_results(loc):
""" Plot graphs of new cases, new fatalities and gov measures vs time
Plotting real ground truth data from the cleaned data sources.
param loc: Location (country) to plot data for
"""
global graphnum
country_groups = combined_daily_df.groupby(['country'])
grp = country_groups.get_group(loc)
data = grp[grp['DayNumber'] > -5]
fig = plt.figure(figsize=(18, 5))
ax1 = fig.add_subplot(111)
ax1.plot(data['DayNumber'], data['7DANewCases'], color='blue')
ax1.set_ylabel('New Cases', color='blue')
ax2 = ax1.twinx()
ax2.plot(data['DayNumber'], data['7DANewFatalities'],
color='red')
ax2.set_ylabel("New Fatalities", color='red')
ax2.set_xlabel("DayNumber")
plt.xticks(rotation='vertical')
plt.title("Graph {0} - 7 Day Rolling Average New Fatalities "
"and New Confirmed Cases in {1}".format(graphnum, loc))
plt.show()
graphnum += 1
# Composite indices
plt.figure(figsize=(16, 5))
plt.plot(data['DayNumber'],
data['StringencyIndex'],
label="Stringency Index")
plt.plot(data['DayNumber'],
data['ContainmentHealthIndex'],
label="Containment and Health Index")
plt.plot(data['DayNumber'],
data['EconomicSupportIndex'],
label="Economic Support Index")
plt.plot(data['DayNumber'],
data['GovernmentResponseIndex'],
label="Government Response Index")
plt.xlabel("DayNumber")
plt.ylabel("Composite Index")
plt.legend()
plt.title("Graph {0} - Composite Indices Calculated "
"From Government Measures in {1}".format(graphnum, loc))
plt.show()
graphnum += 1
graph_daily_measures_and_results('United Kingdom')
The UK implemented a lockdown more slowly than some other European countries (see section 4.3.) and the level of excess deaths is high compared to other European countries. There is a clear double peak in cases and fatalities. The stringency index varied in the range 60 to 80. Increasing the stringency index during the early stages of the first peak seems to lead to a fall in cases and fatalities. It's not easy to interpret in detail how the government measures affected the second peak, but a couple of observations can be made:
graph_daily_measures_and_results('Germany')
Germany has so far suffered fewer cases and fewer fatalities than many of its European neighbours. The stringency index varied more than in the UK, starting at a higher level early in the first peak, and dropping to a lower minimum level. As in the UK, the number of cases detected in the second peak is higher than in the first peak, also probably due to increased testing. The number of fatalities in both peaks is similar.
graph_daily_measures_and_results('Sweden')
Sweden is interesting because it has maintained a low stringency index compared to other European countries. Like Germany and the UK, there is a double peak in cases and fatalities. The peak numbers of cases and fatalities are low. However, Sweden is a relatively sparesly populated country, and (as seen in section 4.1.1) both total fatalities per million and cases per million are relatively high, and excess deaths is around the lower-middle part of the range.
graph_daily_measures_and_results('Japan')
Japan is remarkable because, with the exception of economic support, it adopted relatively light control measures but still had low numbers of cases and fatalities per million. Three peaks are visible but these are all with low numbers. This demonstrates that government measures alone cannot explain the variation in the numbers of cases and fatalities.
graph_daily_measures_and_results('Australia')
Australia also went through two peaks but had relatively few fatalities and cases. Australia implemented strict controls (high stringency index) during both peaks and this probably contributed to the low numbers of cases and deaths. Another factor could be that Australia is an (admittedly giant) island with a fairly low population density; this may have helped their control measures to work well.
graph_daily_measures_and_results('United States')
The United States had a high level of excess deaths during the pandemic. They were one of the slowest countries to implement a lockdown (see section 4.3.). Stringency index varied between just over 60 and just over 70 for the majority of the time, but has recently risen to its maximum value just below 80. There are two clear peaks, but cases and fatalities between them never really fell to a low level, so the crisis has been more continuous than in some other countries.
The size of the country must also be considered a factor - many states are big enough in size and population to be countries in their own right.
graph_daily_measures_and_results('India')
In contrast to the countries analysed above, India is a developing nation, and that means it faces additional challenges during the pandemic. It also means that the data on infection and fatality rates may not be as reliable as for some countries, and data on excess deaths is not available. Unlike some countries, cases and fatalities in India rose to a broad single peak. The stringency index was initially very high but has generally fallen throughout the pandemic. Cases and fatalities have been falling over recent months, although the rate of decrease has slowed in recent weeks.
Comparing the patterns in the numbers of cases and fatalities and government responses between countries shows several features in common:
There are also some notable differences between countries:
The first peak in many countries receded after lockdown measures were introduced by governments. Although this doesn't prove that the measures suppress the virus, it is consistent with that interpretation. Our understanding of the virus is that it spreads when people are close to one another, so it makes sense that government measures preventing close contact do slow the spread of the virus.
The relationship between the measures and the spread of the virus is not simple, though. The graphs above demonstrate this: even countries which implemented similar levels of stringency had very different outcomes. It's likely that other factors such as economics, culture, geography, demographics and population compliance also affect the spread of the virus.
Most governments implemented a set of many measures. It is difficult to extract which particular measures, or combinations of measures, were most important in preventing the spread of the virus. Section 5 and 6 use machine learning methods to explore this further.
It's possible that it's not only the type and level of government responses which are important in controlling the spread of the virus but also the time at which they were implemented. This section will look for a relationship between the time taken to implement lockdown and the total number of excess deaths. Excess deaths is chosen as the measure rather than deaths or cases per million because it provides a better way to compare between contries (see section 4.1.2.)
First plot the variation of one of the composite measures (Stringency Index) over time for all the countries for which we have excess deaths data.
countries = weekly_excess_deaths_df.groupby(['country'])
# Set styles and colours using ColorBrewer palette
STYLES = ['solid', 'dashed']
NUM_STYLES = len(STYLES)
clrs = sns.color_palette('Set1')
NUM_COLOURS = len(clrs)
count = 0
plt.figure(figsize=(18, 9))
for c in list(countries.groups.keys()):
country = countries.get_group(c)
country_data = country[country['WeekNumber'] > -5]
plt.plot(country_data['WeekNumber'],
country_data['StringencyIndex'],
label=c,
color=clrs[count % NUM_COLOURS],
linestyle=STYLES[count % NUM_STYLES])
count += 1
plt.xlabel("WeekNumber")
plt.ylabel("Stringency Index")
plt.xticks(rotation='vertical')
plt.legend()
plt.grid()
plt.title("Graph {} - Graph to Compare Week by Week Variation "
"of Stringency Index Across Countries".format(graphnum))
plt.show()
graphnum += 1
Although there are variations in the nature of government reposnses, most countries seemed to raise stringency to a value above 70. (Sweden is a notable exception.) However, the exact time when each country raised stringency to this level, relative to the time of the first confirmed fatalities, varied. We will arbitrarily define the time of lockdown as the time when stringency index reached 70 or more. Now we can look at the relationship between the time elapsed between DayNumber 1 (see section 3.3.) and the time of lockdown, and the total number of excess deaths.
time_to_lockdown = []
countries = combined_daily_df.groupby(['country'])
country_list = list(total_excess_deaths_per_100k_df['Country'].unique())
for c in country_list:
country = countries.get_group(c)
if country[country['StringencyIndex'] >= 70].shape[0] > 0:
lockdown_day = country[
country['StringencyIndex'] >= 70]['DayNumber'].iloc[0]
time_to_lockdown.append(lockdown_day)
# If StringencyIndex never reaches 70, use the day of max StringencyIndex
else:
lockdown_day = country[country['StringencyIndex'] == country[
'StringencyIndex'].max()]['DayNumber'].iloc[0]
time_to_lockdown.append(lockdown_day)
total_excess_deaths_per_100k_df['time_to_lockdown'] = time_to_lockdown
plt.figure(figsize=(18, 9))
x = total_excess_deaths_per_100k_df['time_to_lockdown']
y = total_excess_deaths_per_100k_df['total_excess_deaths_per_100k']
plt.scatter(x, y)
for i, txt in enumerate(total_excess_deaths_per_100k_df['Country'].to_list()):
plt.annotate(txt, (x.iloc[i], y.iloc[i]))
plt.xlabel("Time to lockdown (days)")
plt.ylabel("Total Excess Deaths per 100k")
plt.xticks(rotation='vertical')
plt.grid()
plt.title("Graph {} - Graph to Show the Relationshp Between "
"Time Taken to Implement Lockdown "
"and Total Excess Deaths per 100k of Population".format(graphnum))
plt.show()
graphnum += 1
There isn't any clear relationship between the time to lockdown and the number of excess deaths. To try and eliminate economic, cultural and other factors from the comparison, it's also worth comparing the European countries alone:
europe = [
'Austria',
'Denmark',
'Norway',
'Portugal',
'Belgium',
'Italy',
'Netherlands',
'Switzerland',
'Germany',
'France',
'United Kingdom',
'Spain',
'Sweden']
plt.figure(figsize=(18, 9))
europe_total_excess_deaths_per_100k_df = total_excess_deaths_per_100k_df[
total_excess_deaths_per_100k_df['Country'].isin(europe)]
x = europe_total_excess_deaths_per_100k_df['time_to_lockdown']
y = europe_total_excess_deaths_per_100k_df['total_excess_deaths_per_100k']
plt.scatter(x, y)
for i, txt in enumerate(europe_total_excess_deaths_per_100k_df['Country'
].to_list()):
plt.annotate(txt, (x.iloc[i], y.iloc[i]))
plt.xlabel("Time to lockdown (days)")
plt.ylabel("Total Excess Deaths per 100k")
plt.xticks(rotation='vertical')
plt.grid()
plt.title("Graph {} - Showing the Relationshp Between Time Taken "
"to Implement Lockdown and Total Excess Deaths "
"per 100k of Population (Europe)".format(graphnum))
plt.show()
graphnum += 1
There is some relationship between the number of excess deaths and the time to lockdown, but it is not a clear or strong one. The five countries with the lowest numbers of excess deaths locked down at different times (over a period of 12 days), and some countries that locked down at the same time ended up with very different levels of excess deaths (compare Norway and Portugal, or Germany and Italy).
Despite this, there is some relationship. Of the five countries with the lowest excess fatalities, three locked down before DayNumber 0, and any that locked down later than DayNumber 7 all had significantly higher levels of excess fatalities.
This complexity and lack of clarity also suggests that there are other factors that influence the total number of excess deaths; it is a complicated interaction.
Knowing who has the virus (and could therefore spread it to others), and where the areas of greatest concentration are, could be used as the basis of a strategy like quarantine of infection, so it follows that the amount of testing carried out may have an indirect influence on the spread of the virus. This section will examine this relationship.
countries = weekly_excess_deaths_df.groupby(['country'])
# Set styles and colours using ColorBrewer palette
STYLES = ['solid', 'dashed']
NUM_STYLES = len(STYLES)
clrs = sns.color_palette('Set1')
NUM_COLOURS = len(clrs)
count = 0
plt.figure(figsize=(18, 9))
for c in list(countries.groups.keys()):
country = countries.get_group(c)
country_data = country[country['WeekNumber'] > -5]
plt.plot(country_data['WeekNumber'],
country_data['new_tests_smoothed_per_thousand'],
label=c,
color=clrs[count % NUM_COLOURS],
linestyle=STYLES[count % NUM_STYLES])
count += 1
plt.xlabel("WeekNumber")
plt.ylabel("Tests per 1000 of population")
plt.xticks(rotation='vertical')
plt.legend()
plt.grid()
plt.title("Graph {} - Graph to Compare Weekly Testing Rates "
"Across Countries".format(graphnum))
plt.show()
graphnum += 1
Although all countries in the graph implemented some level of testing, and this generally increased over time, the amount of testing and the timing when it started varied between countries. Both the amount of testing and the timing of it can be compared to see if there is any relationship to the number of excess deaths.
This section will examine the relationship between the time when a country begins its testing programme and the amount of excess deaths. Plot a graph showing the excess deaths against the week number where tesing began.
first_week_of_testing = []
countries = weekly_excess_deaths_df.groupby(['country'])
country_list = total_excess_deaths_per_100k_df['Country'].unique()
for c in country_list:
country = countries.get_group(c)
firstweek = country[(country['total_tests_per_thousand'] > 0)
| (country['new_tests_per_thousand'] > 0)][
'WeekNumber'].min()
first_week_of_testing.append(firstweek)
total_excess_deaths_per_100k_df[
'first_week_of_testing'] = first_week_of_testing
plt.figure(figsize=(18, 9))
x = total_excess_deaths_per_100k_df['first_week_of_testing']
y = total_excess_deaths_per_100k_df['total_excess_deaths_per_100k']
plt.scatter(x, y)
for i, txt in enumerate(total_excess_deaths_per_100k_df['Country'].to_list()):
plt.annotate(txt, (x.iloc[i], y.iloc[i]))
plt.xlabel("First week of testing (Week Number)")
plt.ylabel("Total Excess Deaths per 100k")
plt.xticks(rotation='vertical')
plt.grid()
plt.title("Graph {} - Graph to Show the Relationshp Between Time Taken "
"to Start Testing and Total Excess Deaths per 100k of Population"
.format(graphnum))
plt.show()
graphnum += 1
There is possibly some indication of a relationship here but not a very strong one. Next perform the same comparison but with only the European countries, to try and control for factors like economics and culture:
plt.figure(figsize=(18, 9))
europe_total_excess_deaths_per_100k_df = total_excess_deaths_per_100k_df[
total_excess_deaths_per_100k_df['Country'].isin(europe)]
x = europe_total_excess_deaths_per_100k_df['first_week_of_testing']
y = europe_total_excess_deaths_per_100k_df['total_excess_deaths_per_100k']
plt.scatter(x, y)
for i, txt in enumerate(europe_total_excess_deaths_per_100k_df['Country'
].to_list()):
plt.annotate(txt, (x.iloc[i], y.iloc[i]))
plt.xlabel("First week of testing (Week Number)")
plt.ylabel("Total Excess Deaths per 100k")
plt.xticks(rotation='vertical')
plt.grid()
plt.title("Graph {} - Graph to Show the Relationshp Between Time Taken"
" to Start Testing and Total Excess Deaths per 100k of Population"
" in Europe".format(graphnum))
plt.show()
graphnum += 1
All the countries that had the lowest levels of excess deaths started testing before Week 0 which does suggest early testing can help minimise the impact of the virus. However, it is still not a clear relationship, as shown by comparing Portugal and Denmark, or France and the Netherlands.
Next look at the relationship between total number of tests per thousand (cumulative), per week of the crisis, with the total number of excess deaths. Note that to allow comparison between countries, use only data up to WeekNumber 25, for which data for all countries in the excess deaths set is available. (This is the same approach as was applied to the data in section 4.1.2.2. above.)
mean_tests_per_thousand_per_week = []
modal_test_units = []
countries = weekly_excess_deaths_df.groupby(['country'])
country_list = total_excess_deaths_per_100k_df['Country'].unique()
for c in country_list:
country = countries.get_group(c)
# Data not available for all countries after latest week
tests = country[country['WeekNumber'] <= latest_week][
'total_tests_per_thousand'].max() / country['WeekNumber'].max()
# If 'total_tests_per_thousand' is 0 check 'new_tests_per_thousand'
# in case it is >0
if tests == 0:
# Data not available for all countries after latest week
tests = country[country['WeekNumber'] <= latest_week][
'new_tests_per_thousand'].sum() / country['WeekNumber'].max()
test_units = country['tests_units'].mode()[0]
mean_tests_per_thousand_per_week.append(tests)
modal_test_units.append(test_units)
total_excess_deaths_per_100k_df['mean_tests_per_1000_per_week'
] = mean_tests_per_thousand_per_week
total_excess_deaths_per_100k_df['tests_units'] = modal_test_units
plt.figure(figsize=(18, 9))
x = total_excess_deaths_per_100k_df['mean_tests_per_1000_per_week']
y = total_excess_deaths_per_100k_df['total_excess_deaths_per_100k']
plt.scatter(x, y)
point_colours = ['red', 'brown', 'green', 'orange', 'blue', 'purple']
for i, c in enumerate(total_excess_deaths_per_100k_df['Country'].to_list()):
plt.annotate(c, (x.iloc[i], y.iloc[i]),
color=point_colours[int(total_excess_deaths_per_100k_df[
total_excess_deaths_per_100k_df['Country'] == c][
'tests_units'].values[0])])
plt.xlabel("Mean Tests per Thousand of Population per Week")
plt.ylabel("Total Excess Deaths per 100k")
plt.xticks(rotation='vertical')
plt.grid()
plt.text(10, 160, "Test Units:\n"
"Red: Units not clear\n"
"Brown: Tests conducted\n"
"Green: Tests conducted (incl. non-PCR)\n"
"Orange: Samples tested\n"
"Blue: People tested\n"
"Purple:People tested (incl. non-PCR)",
bbox={'facecolor': 'grey', 'alpha': 0.5, 'pad': 10})
plt.title("Graph {} - Graph to Show the Relationshp Between "
"Mean Tests per 1000 of Population per Week "
"and Total Excess Deaths per 100k of Population".format(graphnum))
plt.show()
graphnum += 1
From this graph, it appears that there may be a relationship between the mean levels of testing and total excess deaths. Broadly, countries with a smaller amount of testing have higher excess deaths. This is not a strong relationship, though.
In addition, this may not be comparing like with like. For certain, different countries use different ways of measuring the number of tests administered (the graph above represents this by the colour of the country name).
In addition, it may be useful to compare countries that are more similar in other ways (e.g. economically) - the following graph compares testing data only for the European countries:
plt.figure(figsize=(18, 9))
europe_total_excess_deaths_per_100k_df = total_excess_deaths_per_100k_df[
total_excess_deaths_per_100k_df['Country'].isin(europe)]
x = europe_total_excess_deaths_per_100k_df['mean_tests_per_1000_per_week']
y = europe_total_excess_deaths_per_100k_df['total_excess_deaths_per_100k']
plt.scatter(x, y)
point_colours = ['red', 'brown', 'green', 'orange', 'blue', 'purple']
for i, c in enumerate(europe_total_excess_deaths_per_100k_df['Country'
].to_list()):
plt.annotate(c, (x.iloc[i], y.iloc[i]),
color=point_colours[
int(europe_total_excess_deaths_per_100k_df[
europe_total_excess_deaths_per_100k_df[
'Country'] == c]['tests_units'].values[0])])
plt.xlabel("Mean Tests per Thousand of Population per Week")
plt.ylabel("Total Excess Deaths per 100k")
plt.xticks(rotation='vertical')
plt.grid()
plt.text(10, 80, "Test Units:\n"
"Red: Units not clear\n"
"Brown: Tests conducted\n"
"Green: Tests conducted (incl. non-PCR)\n"
"Orange: Samples tested\n"
"Blue: People tested\n"
"Purple:People tested (incl. non-PCR)",
bbox={'facecolor': 'grey', 'alpha': 0.5, 'pad': 10})
plt.title("Graph {} - Graph to Show the Relationshp Between "
"Mean Tests per 1000 of Population per Week "
"and Total Excess Deaths per 100k of Population "
"(Europe)".format(graphnum))
plt.show()
graphnum += 1
Looking at the European countries, it's not clear that there is any relationship between average testing rates and excess deaths.
This section explored several aspects of the data to look for patterns and inferences. Data about the numbers of cases, fatalities, and excess fatalities shows that the fatality rate of the virus varies widely between different countries. Comparing the government responses to the spread of the virus over time shows that government lockdowns often precede a fall in cases but that the relationship between cases, fatalities and government measures is not a simple one. The level of excess fatalities in a country does have some relationship to both the time when lockdown took place and the time when testing started, with respect to DayNumber 1 of the virus, but it's clear these are not the only factors that are important - other factors also influence this. The picture that is building up is of a complex set of interacting factors all of which influence the spread of the virus and the fatality rate. The next sections will use modelling techniques to attempt to disentangle and understand these factors.
In this section unsupervised learning is applied to the data relating excess deaths to the government measures. The aim is to infer which government measures have the largest influence on achieving lower rates of excess deaths.
The approach will be as follows:
This follows a similar pattern to one of the other Udacity projects in my GitHub repository [19].
Choose the government measures to use in the prediction. Excluding some features is a good idea to reduce the complexity of the model.
The measures: H4_Emergency investment in healthcare, H5_Investment in vaccines, E3_Fiscal measures and E4_International support are all expressed as one-off US dollar monetary amounts. This is very different from the ordinal and binary measures used in the other features so these will be excluded. This leaves the following government measures:
measures = [
'C1_School closing',
'C1_Flag',
'C2_Workplace closing',
'C2_Flag',
'C3_Cancel public events',
'C3_Flag',
'C4_Restrictions on gatherings',
'C4_Flag',
'C5_Close public transport',
'C5_Flag',
'C6_Stay at home requirements',
'C6_Flag',
'C7_Restrictions on internal movement',
'C7_Flag',
'C8_International travel controls',
'E1_Income support',
'E1_Flag',
'E2_Debt/contract relief',
'H1_Public information campaigns',
'H1_Flag',
'H2_Testing policy',
'H3_Contact tracing',
'H6_Facial Coverings',
'H6_Flag'
]
The analysis will focus on the measures implemented by the governments near the time of lockdown. To do this, find the mean level of each measure in the week before and three weeks after lockdown (defined as the day when stringency index reached 70 - see section 4.3.).
countries = combined_daily_df.groupby(['country'])
country_list = list(total_excess_deaths_per_100k_df['Country'].unique())
lockdown_measures = np.zeros((len(country_list), len(measures)))
for i, c in enumerate(country_list):
country = countries.get_group(c)
lockdown_day = total_excess_deaths_per_100k_df[
total_excess_deaths_per_100k_df['Country'] == c]['time_to_lockdown'
].values[0]
lockdown_lower_limit = lockdown_day - 7
lockdown_upper_limit = lockdown_day + 21
lockdown_data_sample = country[(country['DayNumber']
> lockdown_lower_limit)
& (country['DayNumber']
<= lockdown_upper_limit)]
for j, m in enumerate(measures):
# Account for mode() returning two values
lockdown_measures[i][j] = lockdown_data_sample[m].mode()[0]
lockdown_measures_df = pd.DataFrame(data=lockdown_measures, columns=measures)
lockdown_measures_df.insert(loc=0, column='Country', value=country_list)
lockdown_measures_df
It's clear that several features have the same value for all countries. These features can be dropped because they add nothing to the analysis.
for col in lockdown_measures_df.columns:
if len(lockdown_measures_df[col].unique()) == 1:
lockdown_measures_df.drop(columns=col, inplace=True)
lockdown_measures_df.shape
lockdown_measures_df now only contains features relevant to the clustering analysis.
Now merge lockdown_measures_df into total_excess_deaths_per_100k_df. This combines the lockdown measures we want to cluster with the data about excess deaths.
total_excess_deaths_per_100k_df = total_excess_deaths_per_100k_df.merge(
right=lockdown_measures_df, how='inner', on='Country')
total_excess_deaths_per_100k_df
We are building a list of features to cluster. The country and total_excess_deaths_per_100k features are not useful for the clustering analysis so will not be included. The level of testing does not seem to have much effect on the number of excess deaths (as seen in section 4.4.) so features related to testing will also be excluded. This allows the list of features to be defined in cluster_features:
cluster_features = list(total_excess_deaths_per_100k_df.columns)
cluster_features.remove('Country')
cluster_features.remove('total_excess_deaths_per_100k')
cluster_features.remove('mean_tests_per_1000_per_week')
cluster_features.remove('tests_units')
cluster_features
Next, define two subsets of the data, one including the countries with the highest excess death rates, and one including the countries with the lowest excess death rates (see section 4.1.2.2.).
high_xs_df = total_excess_deaths_per_100k_df[
(total_excess_deaths_per_100k_df['Country'] == 'Mexico') |
(total_excess_deaths_per_100k_df['Country'] == 'Spain') |
(total_excess_deaths_per_100k_df['Country'] == 'Peru') |
(total_excess_deaths_per_100k_df['Country'] == 'Ecuador') |
(total_excess_deaths_per_100k_df['Country'] == 'Belgium')][
cluster_features]
low_xs_df = total_excess_deaths_per_100k_df[
(total_excess_deaths_per_100k_df['Country'] == 'Germany') |
(total_excess_deaths_per_100k_df['Country'] == 'Denmark') |
(total_excess_deaths_per_100k_df['Country'] == 'Austria') |
(total_excess_deaths_per_100k_df['Country'] == 'Norway') |
(total_excess_deaths_per_100k_df['Country'] == 'Switzerland')][
cluster_features]
Confirm that the data contains no null values (which would break the clustering algorithm):
all_xs_df = total_excess_deaths_per_100k_df[cluster_features]
all_xs_df.isnull().sum()
Scale data with StandardScaler. Save the scaling model for later use on the high_xs and low_xs dataframes.
# Apply feature scaling
scaler = StandardScaler()
# Fit data, save scaling model
scaling_model = scaler.fit(all_xs_df)
# Transform the data
all_xs_df_scaled = scaling_model.transform(all_xs_df)
Use principal component analysis to reduce the number of dimensions in the dataset.
# Apply PCA to the data.
pca = PCA(random_state=11)
pca_model = pca.fit(all_xs_df_scaled)
Examine the variance represented by each principal component. The aim is to identify the most important ones, retain those, and drop the others from the analysis to produce a simpler model.
# Calculate variance values and a cumulative total
num_components = len(pca_model.explained_variance_ratio_)
ind = np.arange(num_components)
vals = (pca_model.explained_variance_ratio_) * 100
cumvals = np.cumsum(vals)
# Plot graphs of variance value and cumulative total
fig, ax = plt.subplots(2, 1, figsize=[18, 20])
ax[0].set_title("Graph {} - Variance Explained by Each Principal Component"
.format(graphnum))
ax[0].bar(ind, vals)
ax[0].xaxis.set_tick_params(width=0)
ax[0].yaxis.set_tick_params(width=2, length=12)
ax[0].set_xlabel("Principal Component")
ax[0].set_ylabel("Variance Explained (%)")
ax[0].xaxis.set_ticks(np.arange(0, 18, 1))
graphnum += 1
ax[1].set_title("Graph {} - Cumulative Variance Explained by Each Principal "
"Component".format(graphnum))
ax[1].plot(ind, cumvals)
ax[1].xaxis.set_tick_params(width=0)
ax[1].yaxis.set_tick_params(width=2, length=12)
ax[1].grid()
ax[1].set_xlabel("Principal Component")
ax[1].set_ylabel("Variance Explained (%)")
ax[1].xaxis.set_ticks(np.arange(0, 18, 1))
ax[1].yaxis.set_ticks(np.arange(0, 100, 10))
graphnum += 1
plt.show()
Nine principal components (0-8) account more for than 85% of the variance. Re-fit the PCA model with 9 principal components and transform the full data set into the principal components. Retain the model for future use on the low_xs and high_xs datsets.
# Re-apply PCA to the data for 9 principal components.
pca = PCA(n_components=9, random_state=20)
# Retain PCA model for later use on customer data.
pca_model = pca.fit(all_xs_df_scaled)
all_xs_pca = pca_model.transform(all_xs_df_scaled)
Each principal component (labelled from 0 to 8) consists of a weighted sum of the government measures we're interested in. We now interpret the principal components in terms of those government measures.
A principal component (PC) will itself be a weighted sum of the features in the data set. In general, a principal component will have some features which are positive and some which are negative.
A principal component will increase in the positive direction (or decrease in the negative direction) with an increase in the absolute value of the positive features. The reverse is true for the negative features, i.e. a principal component will decrease in the positive direction (or increase in the negative direction) with an increase in the absolute value of the negative features.
Each of the PCs will be interpreted accordingly.
def interpret_pc(pca_model, pca_num):
"""Print principal components including top 3 positive and negative.
param pca_model: fitted PCA model
param pca_num: principal component number to interpret
"""
pc_to_feat = pd.DataFrame(pca_model.components_, columns=all_xs_df.columns)
pc = pc_to_feat.iloc[pca_num]
print("All component weights for Principal Component {}, "
"sorted by weight:".format(pca_num))
print(pc.sort_values(ascending=False))
print("")
print("3 largest positive-weighted features in Principal Component {0}:"
.format(pca_num))
print(pc.sort_values(ascending=False).iloc[0:3])
print("")
print("3 largest negative-weighted features in Principal Component {0}:"
.format(pca_num))
print(pc.sort_values(ascending=False).iloc[-3:])
interpret_pc(pca_model, 0)
interpret_pc(pca_model, 1)
interpret_pc(pca_model, 2)
interpret_pc(pca_model, 3)
interpret_pc(pca_model, 4)
interpret_pc(pca_model, 5)
interpret_pc(pca_model, 6)
interpret_pc(pca_model, 7)
interpret_pc(pca_model, 8)
Now use K-means clustering on the data. We don't initially know how many clusters to choose, so create a number of models with increasing numbers of clusters and score them (a lower score indicates a better fit).
MAX_CLUSTERS = 17
kmeans = []
models = []
scores = []
for clusters in range(MAX_CLUSTERS):
kmeans.append(KMeans(clusters+1, random_state=42))
models.append(kmeans[clusters].fit(all_xs_pca))
scores.append(np.abs(models[clusters].score(all_xs_pca)))
print("Score for K = {0} is {1}".format(clusters+1, scores[clusters]))
Plot the variation in score with the number of centres (K). We are looking for the "elbow point" where the rate of decrease in score noticably flattens.
centres = list(range(1, MAX_CLUSTERS+1))
plt.figure(figsize=(18, 9))
plt.plot(centres, scores)
plt.title("Graph {} - Variation in Score with Number of Centres (K)"
.format(graphnum))
plt.xlabel("Number of Centres (=K)")
plt.ylabel("Score")
plt.xticks(centres)
plt.show()
graphnum += 1
After trying several values for K, set K=6. Use the model trained above with K=6 to calculate the cluster labels for each country.
CLUSTERS = 6
kmeans_model = models[CLUSTERS-1]
all_xs_kmeans_labels = kmeans_model.predict(all_xs_pca)
Now apply the scaling and PCA transform steps (using the same models trained on all the data above) to the high and low excess deaths groups. Then use the K-means model trained above to calculate their cluster labels.
low_xs_scaled = scaling_model.transform(low_xs_df)
low_xs_pca = pca_model.transform(low_xs_scaled)
low_xs_kmeans_labels = kmeans_model.predict(low_xs_pca)
high_xs_scaled = scaling_model.transform(high_xs_df)
high_xs_pca = pca_model.transform(high_xs_scaled)
high_xs_kmeans_labels = kmeans_model.predict(high_xs_pca)
Now compare the proportions of each data set (all_xs_pca, low_xs_pca and high_xs_pca) represented in each cluster.
# Calculate the proportion of the data set made up by each cluster
low_xs_label_count = np.bincount(low_xs_kmeans_labels, minlength=CLUSTERS)
high_xs_label_count = np.bincount(high_xs_kmeans_labels, minlength=CLUSTERS)
all_xs_label_count = np.bincount(all_xs_kmeans_labels, minlength=CLUSTERS)
low_xs_proportion = low_xs_label_count / len(low_xs_kmeans_labels)
high_xs_proportion = high_xs_label_count / len(high_xs_kmeans_labels)
all_xs_proportion = all_xs_label_count / len(all_xs_kmeans_labels)
k = np.arange(0, CLUSTERS)
# Plot a chart to show the proportion of each data set
# represented by each cluster
fig, ax = plt.subplots(figsize=[18, 9])
width = 0.25
p1 = ax.bar(k, all_xs_proportion * 100, width, color='b')
p2 = ax.bar(k+width, low_xs_proportion * 100, width, color='g')
p3 = ax.bar(k+2*width, high_xs_proportion * 100, width, color='r')
ax.set_title("Graph {} - Proportion of Data Set Represented by Each Cluster "
"for all_xs, low_xs and high_xs Data Sets".format(graphnum))
ax.set_xlabel("Cluster Number (K)")
ax.set_ylabel("Proportion (%)")
ax.set_xticks(k)
ax.legend((p1[0], p2[0], p3[0]), ('all_xs', 'low_xs', 'high_xs'))
ax.autoscale_view()
plt.show()
graphnum += 1
When looking at this graph it's important to remember that the low_xs and high_xs groups contain only 5 countries. This means that while a representation of 20% of the group might look a lot, it only represents 1 country. all_xs is bigger (21 countries) but still around 5% of that group represents 1 country. Just to illustrate this: the low_xs group looks very over-represented in cluster 4, but in reality there is just one country in that cluster, so we cannot draw any conclusion from this.
Knowing this, we can be appropriately careful about interpreting the results. Even with appropriate care, it's important not to draw very firm conclusions from such a small data set.
Nevertheless, the graph shows that the low_xs group is over-represented in cluster 2, with 4 countries, compared to 1 country from the high_xs group and one additional country from neither (6 countries total in this cluster.) low_xs is also under-represented in cluster 1, which contains no low_xs countries, and 6 countries in total, 2 of which are from the high_xs group. So these two clusters are worth looking at more closely.
(The cluster labels can vary with repeated runs of the clustering algorithm - this is normal for K-means clustering because there is an element of randomness in the algorithm. However, the interpretation of the components remains the same.)
This is a small data set so it is important not to read too much into the results. Therefore, we will focus on the single most important principal component represented by each cluster, and the top/bottom two features that contribute to that component.
def interpret_cluster(pca_model, cluster):
"""Print most significant principal component in a cluster.
param pca_model: Fitted PCA model
param cluster: Cluster number
"""
cluster_pcs = pd.DataFrame(pca_model.cluster_centers_[cluster])
print("Most Significant Principal Component (by absolute value) "
"in Cluster {0}:".format(cluster))
print(cluster_pcs.reindex(
cluster_pcs.abs().sort_values(0, ascending=False).index).iloc[0])
# low_xs_df is over represented in cluster 2
# Print details of principal components:
print(kmeans_model.cluster_centers_[2])
print("")
interpret_cluster(kmeans_model, 2)
PC3 negative:
Countries in this cluster have relatively high scores for:
C4_Restrictions on gatheringsE1_Income supportCountries in the low_xs group tend to implement C4 and E1 relatively strongly compared to those in the high_xs group.
Countries in this cluster have relatively low scores for:
C5_Close public transportC5_FlagCountries in the high_xs group tend to implement C5 relatively strongly compared to those than in the low_xs group.
Countries in the low_xs group are over-represented in cluster 2. Countries in the high_xs group are averagely represented in this cluster.
# low_xs_df is under represented in cluster 1.
# Print details of principal components:
print(kmeans_model.cluster_centers_[1])
print("")
interpret_cluster(kmeans_model, 1)
PC1 negative:
Countries in this cluster have high values for:
C6_Stay at home requirementsH3_Contact tracingAnd low scores for:
time_to_lockdownfirst_week_of_testingThis means that although countries in this cluster tend to have implemented stay at home requirements and contact tracing relatively strongly, they also have a longer time_to_lockdown and a later first_week_of_testing.
Countries in the low_xs group are under-represented in this cluster, which countries in the high_xs group have average representation.
In drawing conclusions from the model above it is important to keep in mind that the set of countries for which excess deaths data is available is small. For this reason it would be wrong to draw very firm conclusions. Nevertheless, it can be used to suggest some tentative conclusions, subject to further investigation.
The model tentatively suggests that the following measures are more strongly associated with low excess deaths:
C4_Restrictions on gatheringsE1_Income supportcompared to the measures:
C5_Close public transportC5_Flag
which are less associated with low excess deaths.The model also suggests that low values for the following measures are more strongly associated with low excess deaths:
time_to_lockdown (a low value means earlier lockdown)first_week_of_testing (a low value means starting testing sooner)compared to the measures:
C6_Stay at home requirementsH3_Contact tracingRestricting gatherings would be expected to reduce transmission of the virus through person-to-person contact. Income support allows people to stop going to work which minimises another potential transmission source.
Income support might also indirectly mean that countries that are richer fare better in the pandemic. This probably follows, because if the government can afford to support people financially then they are also more likely to be able to provide a good standard of healthcare. This illustrates a very important point applicable to all the conclusions here: correlation between factors does not prove causation. Income support is correlated with low exess deaths, but the model cannot tell us if this is because income support is itself the cause of lower excess deaths, or whether other factors related to wealthy countries are more important.
It's logical to expect that stay at home requirements and contact tracing would reduce transmission and deaths. Nevertheless, this model suggests that the speed of the response (a quicker lockdown, and a quicker start to the testing programme) is a more important factor in reducing the number of excess deaths overall.
It is important to note that although some measures may be identified as more strongly associated with minimising deaths, this is relative to other measures. It does not mean that measures that are not strongly associated with a lower level of excess deaths are pointless - indeed, those measures may save very many lives. It just means that some measures are more effective than others.
The aim of this section is to create a prediction model for the time evolution of the number of cases in each country. The main input features will be the government measures taken. The model's performance as a predictor is one factor of interest, but the importance of the features in the model will also be analysed. The intention is to use the feature importances to infer which government measures are most important in controlling the spread of the virus, if possible.
The approach will be to build one prediction model for all countries, as opposed to building a separate model for each country. The advantage is that the single model can be trained with many more data points.
The target value for the prediction model is the number of confirmed cases reported by each country, each day (7DANewCases). This model will use the well-known method of using lag and difference features to predict a time series (see for example [16]). Adding more lag features tends to improve accuracy but also increases complexity. After several experiments, seven lag features based on 7DANewCases was found to give a good performance. Lag features are denoted by a postfix for the number of days of lag. Specifically:
7DANewCases-1: 7DANewCases on (day-1)7DANewCases-2: 7DANewCases on (day-2)7DANewCases-7: 7DANewCases on (day-7)With the same convention for other lag features.
7DADeltaNewCases is already calculated in the data in previous sections and this can be used to create lagged difference features:
7DADeltaNewCases-1 = (7DANewCases-1) - (7DANewCases-2)7DADeltaNewCases-2 = (7DANewCases-2) - (7DANewCases-3)7DADeltaNewCases-7 = (7DANewCases-7) - (7DANewCases-8)The model will also use government measures as input values. It is useful to understand some essential features of the virus in order to find the best way to incorporate these.
When a patient is infected with COVID-19, the virus seems to follow the following (approximate) course:
Severe cases may require longer hospitalisation, perhaps several weeks.
This is a very simple, generalised model, and there is variation in the way the virus progresses in individual patients. Neverthless we will use this approximation to guide the construction of the model. Then it's clear that the number of new cases today depends partly on the government measures in place 5 to 14 days ago. One way to create the model would be to calculate 10 lag features for each government response, like this:
C1_School closing-5: C1_School closing on day-5C1_School closing-6: C1_School closing on day-6C1_School closing-14: C1_School closing on day-14The disadvantage is that this will lead to a model with hundreds of features and the complexity this introduces may present problems with training and interpreting the model. Instead, calculate these lag features and find the mean value over the period 5 to 14 days before the prediction. Thus, each government measure generates one feature.
The number of new cases must also depend partly on the number of infectious cases in the community 5 to 14 days ago. This model does not attempt to directly estimate the number of infectious cases in the community but the lag features based 7DANewCases described above provide an indirect indication of this.
With this approach, we can now construct a model.
First prepare the data to be used in the prediction model.
Begin by dropping the most recent week of data. This is because new data arrives daily but is often revised and updated in the following days, so the most recent week is often unreliable.
combined_daily_df = combined_daily_df[combined_daily_df['date'] <
(combined_daily_df['date'].max()) -
timedelta(7)].copy()
Determine the last date in the data series. This will be used in several places in this section.
LAST_DATE = combined_daily_df['date'].max()
It is possible that some locations have not yet had a major outbreak of the virus and these will skew both training and testing. Therefore, exclude countries which have not been significantly affected by the virus to date. Any countries that have not reached Day 60 of the virus will be dropped.
virus_free = []
country_groups = combined_daily_df.groupby(['country'])
for c in list(country_groups.groups.keys()):
country_group = country_groups.get_group(c)
if country_group[country_group['date'] <=
LAST_DATE]['DayNumber'].max() < 60:
virus_free.append(c)
print("Countries dropped from analysis because "
"they have been minimally impacted: {}".format(virus_free))
combined_daily_df.drop(combined_daily_df.loc[
combined_daily_df['country'].isin(virus_free)].index, inplace=True)
The country name should be treated as a categorical variable in the prediction model. So the model understands this, encode country as a categorical variable now:
# Encode country field
labenc = LabelEncoder()
combined_daily_df['country_num'] = labenc.fit_transform(
combined_daily_df['country'])
combined_daily_df['country_cat'] = combined_daily_df[
'country_num'].astype('category')
combined_daily_df.drop(columns='country_num', inplace=True)
combined_daily_df.head()
# Just for debugging convenience. These lines are useful during development.
# Uncomment this line to save a copy of the source data frame before processing
# combined_daily_df_saved = combined_daily_df.copy()
# Uncomment this line to reload the data frame saved in the line above
# combined_daily_df = combined_daily_df_saved.copy()
Choose the government measures to use in the prediction. Excluding some features is a good idea to reduce the complexity of the model.
Since H4_Emergency investment in healthcare and H5_Investment in vaccines are mostly focussed on treating the disease rather than stopping the spread these will be excluded from the analaysis.
E3_Fiscal measures and E4_International support are both measures expressed as one-off US dollar monetary amounts. This is very different from the ordinal and binary measures used in the other features so these will also be excluded.
gov_measures = [
'C1_School closing',
'C1_Flag',
'C2_Workplace closing',
'C2_Flag',
'C3_Cancel public events',
'C3_Flag',
'C4_Restrictions on gatherings',
'C4_Flag',
'C5_Close public transport',
'C5_Flag',
'C6_Stay at home requirements',
'C6_Flag',
'C7_Restrictions on internal movement',
'C7_Flag',
'C8_International travel controls',
'E1_Income support',
'E1_Flag',
'E2_Debt/contract relief',
'H1_Public information campaigns',
'H1_Flag',
'H2_Testing policy',
'H3_Contact tracing',
'H6_Facial Coverings',
'H6_Flag'
]
As described in section 6.1. Approach, the model will use lag and difference features for the government measures and number of new cases. These will be constructed here.
def make_lag_features(df, base_features, min_lag, max_lag):
"""Create lag features in a data frame.
param df: data frame
param base_features: existing feature to create lag features from
param min_lag: smallest lag
param max_lag: largest lag
"""
country_groups = df.groupby(['country'])
for c in list(country_groups.groups.keys()):
country_group = country_groups.get_group(c)
for i in range(min_lag, max_lag):
for f in base_features:
lagfeat = '{0}-{1}'.format(f, str(i))
df.loc[(df['country'] == c),
lagfeat] = country_group[f].shift(i)
# Replace null values resulting from using shift() with 0
for i in range(min_lag, max_lag):
for f in base_features:
lagfeat = '{0}-{1}'.format(f, str(i))
df[lagfeat].fillna(value=0, inplace=True)
# Define the number of lag features.
# The following two values apply to the gov. measures
MAX_LAG_GOV_MEASURES = 15
MIN_LAG_GOV_MEASURES = 5
# The following two values apply to 7DANewCases and 7DADeltaNewCases
MAX_LAG_CASES = 8
MIN_LAG_CASES = 1
# Lag features for government measures
make_lag_features(combined_daily_df, gov_measures,
MIN_LAG_GOV_MEASURES, MAX_LAG_GOV_MEASURES)
# Lag features for new cases
make_lag_features(combined_daily_df, ['7DANewCases', '7DADeltaNewCases'],
MIN_LAG_CASES, MAX_LAG_CASES)
combined_daily_df.head()
As described in section 6.1. Approach, the model will use mean values of the lag features for the government measures to minimise complexity. Calculate these mean values now.
def find_lag_features(features, max_lag, min_lag):
""" Create a list of lag feature names from their base features.
param features: list of base features not including lag features
param min_lag: smallest lag
param max_lag: largest lag
return lag_features: list of lag feature names not including base features
"""
lag_features = []
lag_features = ['{0}-{1}'.format(f, str(i)) for i in range(
min_lag, max_lag) for f in features]
return lag_features
for measure in gov_measures:
meanfeat = 'mean-{0}'.format(measure)
combined_daily_df[meanfeat] = combined_daily_df[
find_lag_features([measure],
MAX_LAG_GOV_MEASURES,
MIN_LAG_GOV_MEASURES)].mean(axis=1)
def find_mean_features(features):
""" Create a list of mean feature names from their base features.
param features: list of base features not including lag features
return mean_features: list of mean feature names
"""
mean_features = []
mean_features = ['mean-{0}'.format(f) for f in features]
return mean_features
Define the full list of features (including lag and difference features) to be used in the model.
Create a complete list of all features (including lag features).
features = (['country',
'country_cat',
'date',
'DayNumber',
'7DANewCases',
'7DADeltaNewCases']
+ find_lag_features(['7DANewCases', '7DADeltaNewCases'],
MAX_LAG_CASES, MIN_LAG_CASES)
+ find_mean_features(gov_measures))
features
Data will be split into train and test and sets based on country and date.
Randomly split the countries into two groups:
DayNumber 30 for these countries will be used in training, but the rest will be excluded from training and used only in testing.Since we have already excluded countries where the virus has not reached DayNumber 60, we know that for each of the testing countries, there is least 30 days of testing data available. This gives a good amount of testing data.
country_list = list(combined_daily_df['country'].unique())
num_countries = len(country_list)
test_countries = random.sample(country_list, k=round(num_countries/4))
test_countries
Create two separate dataframes for training and testing. combined_daily_df_test holds data for the test countries, and drop these countries from the training set (combined_daily_df_train).
combined_daily_df_test = combined_daily_df.drop(
combined_daily_df.loc[~combined_daily_df['country'].isin(test_countries)]
.index)
combined_daily_df_train = combined_daily_df.drop(combined_daily_df.loc[
(combined_daily_df['country'].isin(test_countries))
& (combined_daily_df['DayNumber'] > 30)].index)
# Create a feature True7DANewCases to hold the ground truth values.
# This is because 7DANewCases will be updated dynamically when long range
# predictions are made.
combined_daily_df_test['True7DANewCases'] = \
combined_daily_df_test['7DANewCases']
Light GBM Regressor (LGBMRegressor) is used in this model. Several other models were tried, including XGBMRegressor, GradientBoostingRegressor and RandomForestRegressor. In my experiments, LGBMRegressor gives the best performance.
Data is scaled with MinMaxScaler. In my experiments, this gives a better performance than StandardScaler.
These are combined in a pipeline.
Grid search with cross validation is used to find the optimal parameters. Each training pass attempts to predict the new daily cases (7 day average) for one country/date combination in the training set. GridSearchCV handles the validation testing automatically by reserving a validation set for each of the folds defined in parameter cv.
Several cross validation scorers were tried (including root mean squared error and root mean squared log error). Cross validation scoring based on mean absolute error gives the best results.
mae_score = make_scorer(mean_absolute_error, greater_is_better=False)
def train_model(parameters, X_train, y_train):
""" Train the model.
param parameters: Model parameters to optimise.
param X_train: Input features for the model.
param y_train: Labels (ground-truth 7DANewCases) for training the model
return lgbmcv: Trained model.
"""
pipeline = Pipeline([
('scaler', MinMaxScaler()),
('mdl', LGBMRegressor())
])
lgbmcv = GridSearchCV(pipeline,
cv=5,
scoring=mae_score,
param_grid=parameters,
verbose=2)
lgbmcv.fit(X_train, y_train)
return lgbmcv
Define the parameters to optimise during training. Three possible sets of parameters are given below. By default, the set which is uncommented will train a range of parameters. To reduce training time but give a reasonable performance, comment out the default parameters and uncomment the set at the top of the following cell. To search a wider range of parameters, comment out the default parameters and uncomment the set at the bottom of the cell below.
# The following parameters gives reasonable results and are quickest to train.
"""
parameters = {
'mdl__n_estimators': [12000],
'mdl__learning_rate': [0.06],
'mdl__reg_lambda': [0.001],
'mdl__reg_alpha': [0.1],
'mdl__max_depth': [-1],
'mdl__num_leaves': [8],
'mdl__max_bin': [700],
'mdl__boosting_type': ['dart'],
'mdl__drop_rate': [0.1],
'mdl__skip_drop': [0.5],
'mdl__reg_sqrt': [True]
}
"""
# Default parameter set.
# Does some parameter searching without taking an excessively long time.
parameters = {
'mdl__n_estimators': [12000],
'mdl__learning_rate': [0.05, 0.06, 0.07],
'mdl__reg_lambda': [0.001],
'mdl__reg_alpha': [0.1],
'mdl__max_depth': [-1],
'mdl__num_leaves': [6, 8],
'mdl__max_bin': [700],
'mdl__boosting_type': ['dart'],
'mdl__drop_rate': [0.1],
'mdl__skip_drop': [0.5],
'mdl__reg_sqrt': [True],
'mdl__bagging_fraction': [0.8],
'mdl__bagging_freq': [10]
}
# The following gives a wide range of training values but takes a long time.
"""
parameters = {
'mdl__n_estimators': [10000, 12500, 15000],
'mdl__learning_rate': [0.05, 0.06, 0.07],
'mdl__reg_lambda': [0.001],
'mdl__reg_alpha': [0.1],
'mdl__max_depth': [-1, 6, 8, 10],
'mdl__num_leaves': [6, 8, 10],
'mdl__max_bin': [600, 800],
'mdl__boosting_type': ['dart'],
'mdl__drop_rate': [0.1],
'mdl__skip_drop': [0.5],
'mdl__reg_sqrt': [True]
}
"""
The following features are needed in combined_daily_df_train but must be excluded from training.
exclude_features = [
'date',
'country',
'DayNumber',
'7DANewCases',
'7DADeltaNewCases']
Although lag and difference features are used, the way the dataframes are constructed means that every data row in the training and testing set is self-contained. All the lag and difference feaures needed to train the model or make a prediction are included in the row. This means that this is now a supervised learning problem. There are a number of input features to the model, and a known output value. This also means that the training data rows can be shuffled, because their order does not matter.
X_train = combined_daily_df_train[features].copy().drop(
columns=exclude_features).values
y_train = combined_daily_df_train['7DANewCases'].values
X_train_shuf, y_train_shuf = shuffle(X_train, y_train)
lgbmcv = train_model(parameters, X_train_shuf, y_train_shuf)
lgbmcv.best_params_
In order to evaluate the performance of the prediction model, it's necessary to measure the difference between predictions on the test set and the true values. This will be done with three different scoring methods: the root mean square error (RMSE), root mean square log error (RMSLE) and mean absolute error (MAE).
It's also necessary to create a "baseline" model, against which the performance of the regression model can be judged. The "persistence" model will be used for this purpose, which simply predicts that new cases for the next day are the same as new cases for the previous day. If the regression model beats this, then it is making useful predictions.
def rmse(ytrue, ypred):
return np.sqrt(mean_squared_error(ytrue, ypred))
def rmsle(ytrue, ypred):
# Any negative predictions treated as 0
ypred[ypred < 0] = 0
return np.sqrt(mean_squared_log_error(ytrue, ypred))
def persistence_model(y):
"""Predict using the persistence model and calculate errors.
Calculates predictions and associated errors using the persistence model.
Predictions are made from the second day provided in y
onwards. For example, if y is provided from Day 0 then the first
prediction will be made on Day 1. This is for two reasons. First, the
shift() operation means the first "prediction" will always be a 0, i.e.
invalid. Second, and more importantly, it is consistent with the
calculation in regression_model().
param: y - true values for one country
return: rmse_error
return: rmsle_error
return: mae_error
"""
# Ignore the Day 0 prediction - shift() means it is invalid
preds = y['True7DANewCases'].shift(1)[1:].values
trues = y['True7DANewCases'][1:].values
rmse_error = rmse(trues, preds)
rmsle_error = rmsle(trues, preds)
mae_error = mean_absolute_error(trues, preds)
print("RMSE for persistence model = {:.3f}".format(rmse_error))
print("RMSLE for persistence model = {:.5f}".format(rmsle_error))
print("MAE for persistence model = {:.3f}".format(mae_error))
return(rmse_error, rmsle_error, mae_error)
After training the model it can be used to make predictions. The prediction code is written so predictions can be made in one of two modes:
7DANewCases. However, input features like 7DANewCases-1, 7DADeltaNewCases-1, and so on are based on the output of the model. So as the model predicts further and further ahead, it has to use its own predictions as input values. This makes long range prediction much more difficult.def regression_model(X, y):
"""Predict using the regression model and calculate errors.
Calculates predictions and associated errors using the trained prediction
model. Predictions are made from the second day provided in X and y
onwards. For example, if X and y are provided from Day 0 then the first
prediction will be made on Day 1. This is for two reasons. First, the
update_long_range_model() step requires known values on the previous day
to work. Second, and more importantly, it is consistent with the
calculation in persistence_model().
param X: Inputs to use to predict new cases. Must be from from Day 0 on.
param y: Ground truth values for new cases. Must be from from Day 0 on.
return preds: Predictions from the regression model, Day 1 onward
return rmse_error: RMSE error
return rmsle_error: RMSLE error
return mae_error: MAE error
"""
predlist = []
num_days = (X['date'].max() - X['date'].min()).days
# Predict from the second value
for day in range(1, num_days+1):
prediction_date = X['date'].min() + timedelta(day)
X_pred = X[X['date'] ==
prediction_date].drop(columns=exclude_features).values
y_pred = lgbmcv.predict(X_pred)
# New cases can never be <0. Treat predictions <0 as 0.
y_pred[y_pred < 0] = 0
# Record predicted new cases
predlist.append(y_pred)
if LONG_RANGE_MODEL:
update_long_range_model(X, y_pred, prediction_date)
preds = np.array(predlist).reshape(len(predlist),)
# Predictions are made from the second date provided in the inputs.
trues = y['True7DANewCases'][1:].values
# Calculate errors
rmsle_error = rmsle(trues, preds)
rmse_error = rmse(trues, preds)
mae_error = mean_absolute_error(trues, preds)
print("RMSE for regression model = {:.3f}".format(rmse_error))
print("RMSLE for regression model = {:.5f}".format(rmsle_error))
print("MAE for regression model = {:.3f}".format(mae_error))
return(preds, rmse_error, rmsle_error, mae_error)
Now perform predictions on the countries reserved for testing and compare to the persistence model.
Note this simple model only makes a prediction one day ahead. At any time step, it is using true (i.e not predicted) data values from the previous days. This makes it straightforward to compare to the persistence model, which also predicts only one day ahead, and uses data from previous days.
Also note that this model only attempts to make predictions after "Day 1" of the virus (recall this is defined as the first day when 5 or more fatalities due to the virus were recorded).
Now test the model predictions to see how it performs on the test data, compared to the persistence model.
The following functions support the testing process.
def test_model(df_test, country, first_prediction_daynumber):
"""Test regression model and compare performance to persistence model.
Predictions are made from first_prediction_daynumber onwards but the
regression and persistence models both require data starting the day before
first_prediction_daynumber. For example, if first_prediction_daynumber is
1 then this function must provide data to persistence_model() and
regression_model() starting from the day before (Day 0).
param df_test: Dataframe containing test data.
param country: Country to make test predictions on.
param first_prediction_daynumber: First day to make a prediction.
return regression_preds: Predictions from the regression model.
return pers_rmse: RMSE for persistence model.
return pers_rmsle: RMSLE for persistence model.
return pers_mae: MAE for persistence model.
return reg_rmse: RMSE for regression model.
return reg_rmsle: RMSLE for regression model.
return reg_mae: MAE for regression model.
"""
df_X = df_test[(df_test['country'] == country)
& (df_test['DayNumber']
>= first_prediction_daynumber-1)][features]
df_y = df_test[(df_test['country'] == country)
& (df_test['DayNumber']
>= first_prediction_daynumber-1)][['7DANewCases',
'True7DANewCases']]
print(country)
print("Persistence model")
pers_rmse, pers_rmsle, pers_mae = persistence_model(df_y)
print("Regression model")
regression_preds, reg_rmse, reg_rmsle, reg_mae = regression_model(
df_X, df_y)
return (regression_preds,
pers_rmse, pers_rmsle, pers_mae,
reg_rmse, reg_rmsle, reg_mae)
def update_errors(pers_rmse, pers_rmsle, pers_mae,
reg_rmse, reg_rmsle, reg_mae):
"""Update error measures for persistence and regression models
Maintains a list of all the error measures and keeps a track of the
number of times the regression model beats the persistence model.
param pers_rmse: RMSE for persistence model.
param pers_rmsle: RMSLE for persistence model.
param pers_mae: MAE for persistence model.
param reg_rmse: RMSE for regression model.
param reg_rmsle: RMSLE for regression model.
param reg_mae: MAE for regression model.
"""
global persistence_rmses
global persistence_rmsles
global persistence_maes
global regression_rmses
global regression_rmsles
global regression_maes
global count_rmse_reg_better
global count_rmsle_reg_better
global count_mae_reg_better
global count_all_reg_better
persistence_rmses.append(pers_rmse)
persistence_rmsles.append(pers_rmsle)
persistence_maes.append(pers_mae)
regression_rmses.append(reg_rmse)
regression_rmsles.append(reg_rmsle)
regression_maes.append(reg_mae)
print("")
if regression_rmses[-1] < persistence_rmses[-1]:
count_rmse_reg_better += 1
if regression_rmsles[-1] < persistence_rmsles[-1]:
count_rmsle_reg_better += 1
if regression_maes[-1] < persistence_maes[-1]:
count_mae_reg_better += 1
if ((regression_rmsles[-1]
< persistence_rmsles[-1]) and
(regression_rmses[-1]
< persistence_rmses[-1]) and
(regression_maes[-1]
< persistence_maes[-1])):
count_all_reg_better += 1
def summarise_results(persistence_rmses, persistence_rmsles, persistence_maes,
regression_rmses, regression_rmsles, regression_maes,
count_rmse_reg_better, count_rmsle_reg_better,
count_mae_reg_better, count_all_reg_better):
"""Summarise performance of regression model compared to persistence
param persistence_rmse: RMSE for persistence model.
param persistence_rmsle: RMSLE for persistence model.
param persistence_mae: MAE for persistence model.
param regression_rmse: RMSE for regression model.
param reression_rmsle: RMSLE for regression model.
param regression_mae: MAE for regression model.
param count_rmse_reg_better: Count when reg model beats pers by RMSE
param count_rmsle_reg_better: Count when reg model beats pers by RMSLE
param count_mae_reg_better: Count when reg model beats pers by MAE
param count_all_reg_better: Count when reg model beats pers by all measures
"""
print("Mean RMSE for persistence model = {}".format(
np.mean(persistence_rmses)))
print("Mean RMSE for regression model = {}".format(
np.mean(regression_rmses)))
print("Mean RMSLE for persistence model = {}".format(
np.mean(persistence_rmsles)))
print("Mean RMSLE for regression model = {}".format(
np.mean(regression_rmsles)))
print("Mean MAE for persistence model = {}".format(
np.mean(persistence_maes)))
print("Mean MAE for regression model = {}".format(
np.mean(regression_maes)))
print("The regression model beat the persistence model "
" in {0} out of {1} tests by RMSE".format(
count_rmse_reg_better, len(persistence_rmses)))
print("The regression model beat the persistence model "
"in {0} out of {1} tests by RMSLE".format(
count_rmsle_reg_better, len(persistence_rmsles)))
print("The regression model beat the persistence model "
"in {0} out of {1} tests by MAE".format(
count_mae_reg_better, len(persistence_maes)))
print("The regression model beat the persistence model "
"in {0} out of {1} tests by RMSE, RMSLE and MAE".format(
count_all_reg_better, len(persistence_rmses)))
Now test one day ahead prediction. Recall that DayNumber up to and including 30 were used in training, so the first testing DayNumber is 31.
LONG_RANGE_MODEL = False
FIRST_PREDICTION_DAYNUMBER = 31
combined_daily_df_test['ShortRangePrediction'] = 0
persistence_rmses = []
persistence_rmsles = []
persistence_maes = []
regression_rmses = []
regression_rmsles = []
regression_maes = []
count_rmse_reg_better = 0
count_rmsle_reg_better = 0
count_mae_reg_better = 0
count_all_reg_better = 0
for country in test_countries:
(country_preds, pers_rmse, pers_rmsle, pers_mae,
reg_rmse, reg_rmsle, reg_mae) = test_model(combined_daily_df_test,
country,
FIRST_PREDICTION_DAYNUMBER)
print(country)
combined_daily_df_test.loc[(combined_daily_df_test[
'country'] == country) & (combined_daily_df_test['DayNumber']
>= FIRST_PREDICTION_DAYNUMBER),
'ShortRangePrediction'] = country_preds
update_errors(pers_rmse, pers_rmsle, pers_mae,
reg_rmse, reg_rmsle, reg_mae)
summarise_results(persistence_rmses, persistence_rmsles, persistence_maes,
regression_rmses, regression_rmsles, regression_maes,
count_rmse_reg_better, count_rmsle_reg_better,
count_mae_reg_better, count_all_reg_better)
It's useful to visualise how the one day ahead predictions compare to the true values for test countries as a graph.
def graph_predictions(df_test, loc, feature):
"""Graph predicted and true new cases for a country.
param df_test: Dataframe with test data.
param loc: Location (country).
param feature: Prediction model used.
"""
global graphnum
country_groups = df_test.groupby(['country'])
country_grp = country_groups.get_group(loc)
plt.figure(figsize=(18, 5))
plt.plot(country_grp['date'], country_grp[feature],
color='red', label='Predicted Mean New Cases')
plt.plot(country_grp['date'], country_grp['True7DANewCases'],
color='blue', label='True Mean New Cases')
plt.xticks(rotation='vertical')
plt.title("Graph {0} - Graph to Show Predicted and True New Cases "
"(7 Day Rolling Average) from Training Data in {1}"
.format(graphnum, loc))
plt.legend()
plt.show()
graphnum += 1
for country in test_countries:
graph_predictions(combined_daily_df_test[
combined_daily_df_test['DayNumber'] > 30],
country,
'ShortRangePrediction')
The model is performing quite well. Keep in mind that the model is making predictions on just one day, given true input values for all the features. It is not predicting the whole series from a starting point.
Next, plot the relative importance of the features in the trained model. This is useful to see infer how the model is working and (possibly) to understand which government measures have the largest influence on case numbers.
def plot_feature_importance(model, X, num_features=30):
"""Plot feature importance based on explained variance as a graph
in descending order of importance.
param model: Trained model.
param X: Dataframe containing training features.
param num_features: Max number of features to plot.
"""
global graphnum
feature_imp = pd.DataFrame({'Value': model.feature_importances_,
'Feature': X.columns})
plt.figure(figsize=(40, 30))
sns.barplot(x="Value", y="Feature",
data=feature_imp.sort_values(by="Value",
ascending=False)[0:num_features])
plt.title("Graph {}: Graph to Show Feature Importance".format(graphnum))
plt.show()
graphnum += 1
Extract the best model found in training.
model = lgbmcv.best_estimator_.steps[1][1]
plot_feature_importance(
model,
combined_daily_df_test[features].drop(columns=exclude_features))
The most important features in this model are lag features related to the number of cases the day before.
Government measures also appear in the plot above but with very small importances.
According to this, the most important government measures are C5, C6 and C2, which are closing public transport, stay at home requirements and workplace closing, respectively.
However, the low importance suggests that these features are not making a significant difference to the predictions this model is making. Their importance is similar to the that of country_cat - in other words, the country for which the prediction is being made is almost as significant a factor in the model as the government measures themselves. This suggests that although the model does make reasonable predictions on any given day, it cannot tell us much about the relative importance of the government measures in controlling the virus.
In addition, I also ran some experiments with a slightly different model that included all the lag and difference features for 7DANewCases and 7DADeltaNewCases but none of the government measures. This performed just as well as the model that includes the government measures. This suggests that the model is not learning anything particularly useful from the features based on government measures, so it is not appropriate to draw any conclusions about them from this model.
Up until now, the models have been used to predict just one day ahead. This is OK for training and model development but it is not particularly useful in practice. In this section, the trained model will be used to predict several days ahead.
def update_long_range_model(X, y_pred, prediction_date):
""" Update model input features that change as predictions are made.
param X: Dataframe of data used in the prediction process.
param y_pred: A prediction that will also later be used as a model input.
param pediction_date: Date the prediction applies to.
"""
# Update 7DANewCases
X.loc[X['date'] == prediction_date, '7DANewCases'] = y_pred
# Update the 7DANewCases lag features
for i in range(MIN_LAG_CASES, MAX_LAG_CASES):
lagfeat = '7DANewCases-{}'.format(str(i))
if (prediction_date + timedelta(i)) <= LAST_DATE:
X.loc[X['date'] == (prediction_date + timedelta(i)),
lagfeat] = X[X['date']
== prediction_date]['7DANewCases'].values
# Update 7DADeltaNewCases
X.loc[X['date'] == prediction_date, '7DADeltaNewCases'] = (X.loc[
X['date'] == prediction_date, '7DANewCases'].values
- X.loc[X['date'] == (prediction_date - timedelta(1)),
'7DANewCases'].values)
# Update the 7DADeltaNewCases lag features
for i in range(MIN_LAG_CASES, MAX_LAG_CASES):
lagfeat = '7DADeltaNewCases-{}'.format(str(i))
if (prediction_date + timedelta(i)) <= LAST_DATE:
X.loc[X['date'] == (prediction_date + timedelta(i)),
lagfeat] = X[X['date']
== prediction_date]['7DADeltaNewCases'].values
Now perform the long range prediction, starting from DayNumber 31.
LONG_RANGE_MODEL = True
FIRST_PREDICTION_DAYNUMBER = 31
combined_daily_df_test['LongRangePrediction'] = 0
persistence_rmses = []
persistence_rmsles = []
persistence_maes = []
regression_rmses = []
regression_rmsles = []
regression_maes = []
count_rmse_reg_better = 0
count_rmsle_reg_better = 0
count_mae_reg_better = 0
count_all_reg_better = 0
# 7DANewCases, 7DADeltaNewCases and their lag features must be calculated
# from predicted values, so remove the known values from the input features.
# They are still stored in True7DANewCases so nothing is lost.
for i, f in enumerate(['7DANewCases']
+ find_lag_features(['7DANewCases'],
MAX_LAG_CASES,
MIN_LAG_CASES)):
combined_daily_df_test.loc[combined_daily_df_test['DayNumber']
>= i+FIRST_PREDICTION_DAYNUMBER, f] = 0
for i, f in enumerate(['7DADeltaNewCases']
+ find_lag_features(['7DADeltaNewCases'],
MAX_LAG_CASES,
MIN_LAG_CASES)):
combined_daily_df_test.loc[combined_daily_df_test['DayNumber']
>= i+FIRST_PREDICTION_DAYNUMBER, f] = 0
for country in test_countries:
(country_preds, pers_rmse, pers_rmsle, pers_mae,
reg_rmse, reg_rmsle, reg_mae) = test_model(combined_daily_df_test,
country,
FIRST_PREDICTION_DAYNUMBER)
combined_daily_df_test.loc[(combined_daily_df_test[
'country'] == country) & (combined_daily_df_test['DayNumber']
>= FIRST_PREDICTION_DAYNUMBER),
'LongRangePrediction'] = country_preds
update_errors(pers_rmse, pers_rmsle, pers_mae,
reg_rmse, reg_rmsle, reg_mae)
summarise_results(persistence_rmses, persistence_rmsles, persistence_maes,
regression_rmses, regression_rmsles, regression_maes,
count_rmse_reg_better, count_rmsle_reg_better,
count_mae_reg_better, count_all_reg_better)
It's clear the model does not produce good predictions over a long period. The small errors in 7DANewCases soon make the model unstable.
Neverthless, it may work better over a shorter period. Plot graphs of the predicted and true values for a shorter time period (DayNumber 30 to 50) in the future:
for country in test_countries:
graph_predictions(combined_daily_df_test[
(combined_daily_df_test['DayNumber'] > 30)
& (combined_daily_df_test['DayNumber'] < 50)],
country,
'LongRangePrediction')
The model is sometimes making reasonable predictions but in many cases the predictions soon diverge from the true values.
Now compare the prediction errors to the persistence model for a prediction 4 days into the future (DayNumber 31 to 34):
def calculate_errors(df_test, pred_feature):
""" Calculate eror metrics for a predicted result in a data frame.
Allows calculation of error metrics for any predicted feature in a data
frame with an arbitrary date range.
param df_test: Dataframe.
param pred_feature: A predicted feature in df_test to calculate errors for.
"""
preds = df_test[pred_feature].values
trues = df_test['True7DANewCases'].values
rmsle_error = rmsle(trues, preds)
rmse_error = rmse(trues, preds)
mae_error = mean_absolute_error(trues, preds)
print("RMSE for {} model = {:.3f}".format(pred_feature, rmse_error))
print("RMSLE for {} model = {:.5f}".format(pred_feature, rmsle_error))
print("MAE for {} model = {:.3f}".format(pred_feature, mae_error))
return(rmse_error, rmsle_error, mae_error)
persistence_rmses = []
persistence_rmsles = []
persistence_maes = []
regression_rmses = []
regression_rmsles = []
regression_maes = []
count_rmse_reg_better = 0
count_rmsle_reg_better = 0
count_mae_reg_better = 0
count_all_reg_better = 0
for country in test_countries:
print(country)
df_X = combined_daily_df_test[(combined_daily_df_test['country'] == country)
& (combined_daily_df_test['DayNumber'] > 30)
& (combined_daily_df_test['DayNumber'] < 35)]
df_y = combined_daily_df_test[(combined_daily_df_test['country'] == country)
& (combined_daily_df_test['DayNumber'] >= 30)
& (combined_daily_df_test['DayNumber'] < 35)][['7DANewCases',
'True7DANewCases']]
pers_rmse, pers_rmsle, pers_mae = persistence_model(df_y)
reg_rmse, reg_rmsle, reg_mae = calculate_errors(df_X, 'LongRangePrediction')
update_errors(pers_rmse, pers_rmsle, pers_mae, reg_rmse, reg_rmsle, reg_mae)
summarise_results(persistence_rmses, persistence_rmsles, persistence_maes,
regression_rmses, regression_rmsles, regression_maes,
count_rmse_reg_better, count_rmsle_reg_better,
count_mae_reg_better, count_all_reg_better)
Even when the model is predicting just four days into the future, the persistence model is better than the regression model.
After trying a few values it's clear that the regression model only beats the prediction model more often than not when it predicts just two days into the future. The "short range" model that predicted just one day ahead seemed to work quite well, but even those predictions had small errors in them. In the long range model, the predictions become inputs to the model, and as shown in section 6.10 these are the most important features in the model. As a result, the small errors in the predictions are fed back into the model and these quickly lead to large errors after just a few days.
The short range model that predicts 7DANewCases on just one day, given all the ground truth input values, performs quite well compared to the persistence model. However, the most important features by far are the lag and difference features based on 7DANewCases. The government measures have a level of importance similar to the country in the model. Removing the government measures entirely does not significantly impact the accuracy of the model. This suggests that it is not worhtwhile attempting to infer anything about the relative importance of the government measures.
In addition, although the short range model performs well, the accuracy of the predictions fall quickly when it is used as the basis for a long range prediction. In this version, the model uses its own predictions of 7DANewCases as inputs, and attempts to make predictions several days into the future. But the small errors in the predictions of 7DANewCases quickly increase, and less than 5 days into the future the model is not making useful predictions.
Overall, this method used here has some very limited success as a prediction model, but it is not a useful way to explore the relative effectiveness of government responses to COVID-19.
print("Completed execution!")
This project combined several data sets about COVID-19 and the way governments around the world have responded to it to address some specific questions in section 1.3. about the most effective ways to minimise virus cases and fatalities.
The main conclusions are:
Comparing the government responses to the spread of the virus over time shows that government lockdowns often precede a fall in cases but that the relationship between cases, fatalities and government measures is not a simple one. The level of excess fatalities in a country does have some relationship to both the time when lockdown took place and the time when testing started (the earlier these happen, the greater the chance of a lower level of excess fatalities) but it's clear these are not the only factors that are important. There is a complex set of interacting factors all of which influence the spread of the virus and the fatality rate.
The K-means clustering model tentatively suggests that the following measures are more strongly associated with low excess deaths:
C4_Restrictions on gatheringsE1_Income supporttime_to_lockdown (a low value means earlier lockdown)first_week_of_testing (a low value means starting testing sooner)compared to the measures:
C5_Close public transportC5_FlagC6_Stay at home requirementsH3_Contact tracingThe clustering model gives interesting results but it is limited by the lack of countries with excess deaths data.
There are many other aspects of the data that would be very interesting to study.
The relationship between fatality rates and factors such as the prevalence of smoking in a country, or obesity, age, poverty, number of hospital beds, and so on could be explored with visualisations and inferential statistics.
The clustering analysis could be performed with the number of reported fatalities, rather than excess fatalities, to give a larger data set.
In addition to the government measures, it would be interesting to explore population compliance levels inferred from sources such as Google's mobility data [15]
Although the regression model used here was not useful for learning about the importance of government measures there may be other models that are more useful for this. If such a model could be created then the importance of the features based on government measures could be examined and used to infer which are most important. A better way to examine feature importance would be to use Shapley Addative Importances [20].
While I was finishing this project, I read a published study that aims to find the relative effectiveness of government measures to mitigate COVID-19. It uses different methods then this project and it is an interesting read. [18]
Publication: Dong E, Du H, Gardner L. An interactive web-based dashboard to track COVID-19 in real time30120-1). Lancet Inf Dis. 20(5):533-534.